The bread and butter of bioinformatics is R, a programming language and statistical computing environment. This tutorial, written for the MIND Data Science Lab at Massachusetts General Hospital, will prepare you to be up and running with R.
How to install R:
First, download the latest version of base R (at the time of writing, v4.2.2) at The Comprehensive R Archive Network. See the CRAN FAQs for additional information.
Next, install RStudio, an integrated development environment (IDE) for R, at RStudio (download the open source edition).
I also recommend that you install VS Code at Visual Studio Code and Git at Installing Git. VS Code is a general purpose IDE while Git is a software for version control, which you may need to use later.
Next, let’s install some packages (known as libraries in Python, they allow you to import additional functions for more complex analyses). To do so, follow the instructions at How to Install a Package in R, Method #2. Namely, run the following line at the R console in RStudio:
install.packages(c("data.table", "magrittr", "purrr", "ggplot2", "rmarkdown"))
This will install five important packages (and their dependencies); keep this line handy when you need to install more packages!
In a similar manner, install the latest version of Bioconductor (at the time of writing, v3.16) by following the instructions at Bioconductor - Install. The Bioconductor project is a repository of R packages for bioinformatics analyses.
Now, time to start learning R. Some introductory resources below:
First, briefly review lectures 0-1 of Harvard's CS50 course,
publicly available at CS50.
Don't focus on Scratch/C syntax; rather, try to familiarize yourself
with the ideas of variables, functions, arguments, conditionals (i.e.,
if
and if-else
statements), and loops. Best to
start with the lecture notes and Cmd/Ctrl + F these terms.
Big Book of R: a master resource of approx. 250 books on anything and everything R-related. Bookmark this page!
R for Data Science: skim chapters 2-8 of this intro to R book (authored by the one-and-only Hadley Wickham).
If you prefer a video introduction, check out Getting Started with R.
Manipulating tabular data is a key component of data science!
Various packages exist to do so in R; our lab is partial to
data.table
. Start with the following vignette: Introduction
to data.table.
After mastering R, you might find learning Python fundamentals useful as well. A taste of what's to come!
Other pieces of advice which you may find useful:
When possible, write code in notebooks! This allows you to keep
your code and documentation in one place. In R, use R Markdown
Notebooks (.Rmd
files); in Python, use Jupyter
Notebooks (.ipynb
files). Watch this one minute video:
What is R
Markdown?. An R Markdown cheat sheet is available here.
Be sure to frequently comment
your code. You'll thank yourself later when you revisit and revise
old scripts. In both R and Python, adding #
at the
beginning of a line will transform the line to a comment.
Use magrittr
pipes for more readable code: magrittr
and wrapr Pipes in R, an Examination | R-bloggers.
The purrr::map()
functions (i.e., map
functions in the purrr
package) are an advanced alternative
to for
loops which support a functional programming style.
To learn more, read Ch. 21 of R for Data Science at 21
Iteration | R for Data Science, or reference this
Towards Data Science post.
Useful extensions to install in VS Code: Python, R, Jupyter, Git Graph, Microsoft Remote Development.
Questions? Try referencing StackOverflow, a Q&A forum for programmers everywhere: Newest ‘r’ Questions - Stack Overflow. Other domain-specific forums exist as well. The internet is a gold mine of programming resources!
Should you have any questions or comments, please don’t hesitate to reach out to me at anoori1@mgh.harvard.edu.