What is R? Required software.

R (https://www.r-project.org/) is a “free software environment for statistical computing and graphics.” For this class, I recommend installing RStudio, an Integrated Development Environment (IDE) for R. It is available for free here (Desktop version): https://rstudio.com/products/rstudio/

The RStudio interface comes with 4 panels:

Screenshot of RStudio

First steps in R.

In the R console, you can type commands and execute them by hitting “ENTER” on your keyboard. Let’s start with a very simple command: adding two numbers.
Type this in the R console (and execute the command by pressing Enter):

2+2

You should see the result (4) appear on the next line in the prompt (the prompt is the area in the R console where you can type commands).

To keep an archive of the commands you type, we will save the commands in an R source file. Create a new R script in RStudio (File -> New File -> R script) and type your R code in it. One strength of RStudio is that it allows you to execute commands directly from the source file, which is extremely convenient (it allows you to keep your commands saved in the source file, without having to copy-paste the line from the R console into your source file). Simply select the code you want to run, then hit CTRL+ENTER to execute the code. If no code is selected, the current line (where the cursor is) is executed when you press CTRL+ENTER.

Variables.

All programming languages rely on variables to store data. Here is an example of how to use variables in R:

a = 2
b = 4
a + b
## [1] 6

This will assign the value 2 to the variable a and the value 4 to the variable b.

There is another notation available in R to assign values to variables. You can use “<-” instead of “=”. For the purpose of this class, we will consider the two operators as being equivalent, so feel free to use whichever you prefer. In other words, the code above is equivalent to this code below:

a <- 2
b <- 4
a + b
## [1] 6

Variables can store more than just a number. In R, a common type of variable is a vector. Here is an example of a vector containing 3 elements:

myVector = c(10, 32, 18)
myVector
## [1] 10 32 18

You can access the ith element in a vector with the notation vector[i]. For example, to access the second element in myVector:

myVector[2]
## [1] 32

Variables can be used to store more complexes structures, including 2-dimensional arrays (data.frame - we will work with these types of variables a lot in this class). More on this later.

Rules for writing Identifiers in R:

  1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
  2. They must start with a letter or a period. Identifiers that start with a period cannot be followed by a digit.
  3. Reserved words in R cannot be used as identifiers.

Reserved words are words that have a special meaning in R and therefore cannot be used to name variables. Here is a list of reserved words: if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA.

Saving your code.

At this point, it is a good idea to type your code in a “source code” file rather than directly in the prompt. This will allow you to save all the code that you write. Take advantage of the functionalities of RStudio to execute your code directly from the source code editor…

Functions.

A function is a program fragment that can perform a specific task. R comes with a number of functions ready to use, but you can also write your own functions (and you will have to write quite a few functions for ths class…). Functions usually take in one (or several) argument(s) (= variables) and return a result. Here is a basic example of function provided by R: sum

sum(myVector)
## [1] 60

Here, we passed the vector “myVector” to the function “sum” which calculated the sum of all the elements in the vector and returns the results (in this case: 10+32+18 = 60).

Obtaining help in R.

You can obtain help on functins provided by R by typing: ?[name of the function]. For example, type ?sum to print the help for the sum function.
Obviously, all this information is also available online.

Loops.

One of the most useful tools in programming is called a loop. A loop consists in repeating a certain action multiple times, as long as a certain condition is satisfied.
For example, if you want to compute the sum of all numbers between 1 and 10. You could simply write:

1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10
## [1] 55

However, doing the same for all the numbers between 1 and 1,000 would take a lot of time. This type of operation can be done very easily with a loop. Here is the code (explanations below):

i = 1
total= 0
while(i <= 10){
  total = total + i
  i = i + 1
}
total
## [1] 55

This code uses two variables: i and total. The variable i is used to keep track of how many iterations of the loop have been done and the variable total keeps track of the sum of integers being done.
This type of loop is called a while loop. More information about while loops here: https://www.datamentor.io/r-programming/while-loop/

An alternative would be to use a for loop, which has this syntax:

total = 0
for( i in (1:10)){
  total = total + i
}
total
## [1] 55

For loops are also very convenient to enumerate the elements of a vector. See for example:

for(element in myVector){
  print(element)
}
## [1] 10
## [1] 32
## [1] 18

More information about for loops in R: https://www.datamentor.io/r-programming/for-loop/

Writing your own functions.

Let’s write a simple function which computes the sum of all integers between a lower and an upper bound. Here is the source code for such a function:

sumBetween <- function(lowerBound, upperBound){
  total = 0
  for(i in (lowerBound:upperBound)){
    total = total + i
  }
  total
}

And now the result from calling the function to compute the sum of integers between 1 and 10:

sumBetween(1, 10)
## [1] 55

Now that you have created this function, you can easily compute the sum of integers between any 2 bounds. For example:

sumBetween(22, 81)
## [1] 3090

Conditional tests.

In programming, a conditional (or conditional statement/expression) allows you to perform different operations depending on the result of a given test.

Conditional statement flow chart

The most commonly used conditional statement is the if statement. Here is an example of using an if statement inside a function:

isMoreThan10 <- function(myNumber){
  if(myNumber > 10){
    cat(myNumber, " is larger than 10.\n")
  } else {
    cat(myNumber, " is NOT larger than 10\n")
  }
}

isMoreThan10(9)
## 9  is NOT larger than 10
isMoreThan10(10)
## 10  is NOT larger than 10
isMoreThan10(11)
## 11  is larger than 10.
for(i in (8:12)){
  isMoreThan10(i)
}
## 8  is NOT larger than 10
## 9  is NOT larger than 10
## 10  is NOT larger than 10
## 11  is larger than 10.
## 12  is larger than 10.

Note the syntax of the function cat which simply prints some text on the prompt.

You can use nested if/else structures, like so:

isMoreThan10 <- function(myNumber){
  if(myNumber > 10){
    cat(myNumber, " is larger than 10.\n")
  } else {
    if( myNumber == 10 ){
      cat(myNumber, " is equal to 10\n")
    } else {
      cat(myNumber, " is NOT larger than 10\n")
    }
  }
}

for(i in (8:12)){
  isMoreThan10(i)
}
## 8  is NOT larger than 10
## 9  is NOT larger than 10
## 10  is equal to 10
## 11  is larger than 10.
## 12  is larger than 10.

Commenting the code.

A crucial aspect of programming is to keep your code organized and make it easy to (1) share your code with other people and (2) modify your code in the future. Therefore, it is important to document all the code that you write.
In R, you can write a comment by placing a # symbol before the text. Here is an example:

# The plot_usmap function plots a map of the US.
plot_usmap(regions = "states")

This specific comment is not very informative. A more useful one could be:

# The plot_usmap function plots a map of the US.
# Substitute regions = "states" by regions = "counties" to display the map with the counties outline.
plot_usmap(regions = "states")

As a general rule, you should use comments to document every program that you write, and especially to describe the role of each function that you write. The goal is that a someone could use the functions that you have written without having to read the code to understand what your function does.

# The function "sumBetween" computes the sum of all integers between "lowerBound" and "upperBound".
sumBetween <- function(lowerBound, upperBound){
  total = 0
  for(i in (lowerBound:upperBound)){
    total = total + i
  }
  total
}

Note: there are tools (such as Roxygen: https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html) that use a specific syntax of comments to automatically generate the type of documentation that you see when you type for example:

?mean

While the use of these tools is beyond the scope of this class, I encourage you to check them out if you ever need to write a large amount of R code for your research.

First application to genomics.

To illustrate how useful coding can be in the field of genomics, we will start with a simple exercise which consists in writing a piece code to compute the number/frequency of each nucleotide (A, T, C and G) in a DNA sequence. While there are already several packages available to perform this task, we will start by writing our own code (innefficient solution) before moving on to using dedicated packages from BioConductor (http://bioconductor.org/)

A first, inefficient solution.

Let’s start with the most basic algorithm that we can use to perform this task:

sequence <- "GCATCGACTAGCTACGATCGACTAGCTACGATCGATCAGCTAC"

# Let's start by converting this string into a vector of characters:
v_sequence <- strsplit(sequence, split = '', fixed = T)[[1]]

seqLength <- length(v_sequence)
# Note, we could have obtained the length from the string vairable with this code: seqLength <- nchar(sequence)

nbA <- 0
nbT <- 0
nbC <- 0
nbG <- 0

for(i in (1:seqLength)){
  if( v_sequence[i] == "A" ){ nbA <- nbA + 1 }
  if( v_sequence[i] == "T" ){ nbT <- nbT + 1 }
  if( v_sequence[i] == "C" ){ nbC <- nbC + 1 }
  if( v_sequence[i] == "G" ){ nbG <- nbG + 1 }
}
cat("A: ", round(nbA/seqLength, 2), "\n",
    "T: ", round(nbT/seqLength, 2), "\n",
    "C: ", round(nbC/seqLength, 2), "\n",
    "G: ", round(nbG/seqLength, 2), "\n",
    sep = ""
    )
## A: 0.28
## T: 0.21
## C: 0.3
## G: 0.21

A slightly better solution:

One thing to notice is that this code performs the same operation on each possible nucleotide (A, T, C and G). As a general rule, it is preferable to write a generic code (which would work on any nucleotide in this case) and apply it to all the conditions (= all the nucleotides here) using a loop.

Illustration:

sequence <- "GCATCGACTAGCTACGATCGACTAGCTACGATCGATCAGCTAC"
v_sequence <- strsplit(sequence, split = '', fixed = T)[[1]]
seqLength <- length(v_sequence)

nucleotides <- c("A", "T", "C", "G")
nucleotides_observations <- rep(0, length(nucleotides))
names(nucleotides_observations) <- nucleotides

for(nucleotide in v_sequence){
  nucleotides_observations[nucleotide] <- nucleotides_observations[nucleotide] + 1
}

for(nucleotide in nucleotides){
  cat(nucleotide, ": ", round(nucleotides_observations[nucleotide]/seqLength, 2), "\n", sep = "")
}
## A: 0.28
## T: 0.21
## C: 0.3
## G: 0.21

Note how the content of a vector can be access by its indice or its name (if it has been given a name)

print(nucleotides_observations)
##  A  T  C  G 
## 12  9 13  9
# Access with indice:
print(nucleotides_observations[3])
##  C 
## 13
# Access with name:
print(nucleotides_observations["C"])
##  C 
## 13

Exercise.

Modify the code to take into the possibility of ‘N’ in the sequence. Notice how modifying the second version of the code is a lot easier…

Taking advantage of R

Many of the basic operations such as counting certain elements in a vector can be done in R without having to perform a loop. We can use functions provided by R to perform the same operations in a much more efficient way (in the background, R will still be performing some sort of loop, but it will be a lot faster).

Here is how we could imporove our code:

sequence <- "GCATCGACTAGCTACGATCGACTAGCTACGATCGATCAGCTAC"
v_sequence <- strsplit(sequence, split = '', fixed = T)[[1]]
seqLength <- length(v_sequence)

nucleotides <- c("A", "T", "C", "G")
nucleotides_observations <- rep(0, length(nucleotides))
names(nucleotides_observations) <- nucleotides

for(nucleotide in nucleotides){
  nucleotides_observations[nucleotide] <- length( which(v_sequence == nucleotide) )
}

for(nucleotide in nucleotides){
  cat(nucleotide, ": ", round(nucleotides_observations[nucleotide]/seqLength, 2), "\n", sep = "")
}
## A: 0.28
## T: 0.21
## C: 0.3
## G: 0.21

Explanations:

# Observe the results from these two lines of code:
nucleotides == "C"
## [1] FALSE FALSE  TRUE FALSE
which( nucleotides == "C" )
## [1] 3
which( nucleotides != "C" )
## [1] 1 2 4

A trully efficient solution (with an R package).

While R comes by default with a large number of ready-to-use functions (mean, median, statistical tests, plot, …) you will eventually encounter the need to write your own functions to analyze data. But odds are that someone else has already done it and published the code in a package. A package is a set of functions (and sometimes also data / structures) that can be installed in R to expand its capacities. For example, you can find packages dedicated to the analysis of gene expression data, geographical data, the production of pretty graphics, …
To avoid re-inventing the wheel, we rely on packages a lot. he procedure to install a new package depends on your configuration, but from R studio you can use the graphical menu: Tools -> Install Packages or simply type:

Here, we will use a package named biostrings.

install.packages(packageName)

Let’s use the package biostrings to read a fasta file and compute the frequencies of each nucleotide for each entry in the fasta file. You can either create a fasta file yourself, or download a very simple file here.

library(Biostrings) # <- This will load the package, making all its functions available.

genome <- readDNAStringSet("genome.fasta")

alphabetFrequency(genome)
##       A  C  G  T M R W S Y K V H D B N - + .
## [1,] 13 14  9  9 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 22 21 15 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0
alphabetFrequency(genome, as.prob = T)
##              A         C         G         T M R W S Y K V H D B N - + .
## [1,] 0.2888889 0.3111111 0.2000000 0.2000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0.3188406 0.3043478 0.2173913 0.1594203 0 0 0 0 0 0 0 0 0 0 0 0 0 0
for(chr in names(genome)){
  print(chr)
  print(alphabetFrequency(genome[chr]))
}
## [1] "chr_1"
##       A  C G T M R W S Y K V H D B N - + .
## [1,] 13 14 9 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1] "chr_2"
##       A  C  G  T M R W S Y K V H D B N - + .
## [1,] 22 21 15 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Generating pretty reports: R Markdown.

In this class, you will have to perform a number of analyzes using R and submit a weekly lab report. To keep your R code organized alongside the results of the analyzes, we will use the package R Markdown (this webpage was generated using R Markdown…).

In RStudio, make a new R Markdown file and knit it (File -> Knit document or keyboard shortcut: CTRL+SHIFT+K)

A basic tutorial about RMarkdown is available at: https://rmarkdown.rstudio.com/authoring_quick_tour.html

You can also go to this webpage for more information: https://rmarkdown.rstudio.com/lesson-1.html

Conclusions.

After this first lab, you should be comfortable using variables, functions and the two basic loops: for and while. In addition to these basic programming skills, you should now be able to generate a report using R Markdown to document what you are doing.
The best way to learn (IMO) is to practice. Try to write your own functions for other operations such: computing the median value of a vector, finding the smallest/largest value in a vector, …
There is a lot of information about R online. If you have a question about how to do something in R, odds are that someone has already answered this question on http://stackoverflow.com/

If you feel lost, here is a short tutorial that summarizes the basics of R: https://www.statmethods.net/r-tutorial/index.html

And another tutorial which goes more in depth: https://www.datamentor.io/r-programming/

.