Efficient R programming

In this lab, we will learn some of the basic concepts necessary to write code efficiently in R. There are two main aspects to efficiency: how long does it take to write the code and how long does the code takes to run.

Spending hours to optimize a code so that it runs in 1.2 seconds instead of 1.3 seconds might not be very useful. However, investing a little bit of time can lead to huge improvements. If you encounter a situation where your code takes many hours/days to run, you might want to revisit this section.

Most of the information presented here comes from this excellent resource: https://csgillespie.github.io/efficientR/. If you use R heavily, I recommend taking the time to read this source in details.

1. Making use of pre-compiled code and avoiding function calls

When you perform an operation in R, the actual calculations are always performed by a pre-compiled code (usually Fortran or C).

t <- data.frame(
  c1 = c(1,2,3,4),
  c2 = c(10,20,30,40)
)

t[4,2] # <- This actually calls the function "[" to access elements of the data frame.
## [1] 40

You can see the R code for any function by simply typing the name of this function (in the R console) without any parentheses. For example:

runif
## function (n, min = 0, max = 1) 
## .Call(C_runif, n, min, max)
## <bytecode: 0x000001f1b977ee90>
## <environment: namespace:stats>

All what this function does is to call the C version of the runif function…

Possibly the most important advice for efficient programming is to access these functions as quickly as possible. One call to the underlying is going to be a lot more efficient than multiple calls.

Let’s illustrate this problem with a simple function that compute the mean value of a vector of numbers. Note that we will make use of the package microbenchmark to quantify the gain in efficiency.

v <- rnorm(n = 10000, mean = 20, sd = 5)

# A function that computes the average of the numbers in a vector, using a for loop in R.
myMean <- function(v){
  total <- 0
  for(i in (1:length(v))){
    total <- total + v[i]
  }
  avg <- total / length(v)
  avg
}

# Let's compare the speed of execution of this function and the mean function provided by R:
library("microbenchmark")
microbenchmark(myMean(v), mean(v))
## Unit: microseconds
##       expr   min     lq    mean median     uq     max neval
##  myMean(v) 286.9 289.75 614.739  322.3 565.15 16834.3   100
##    mean(v)  21.7  22.00  28.604   22.5  24.10   209.9   100

As you can see, using the mean function provided by R is several orders of magnitudes faster than using the for loop. That is because, at every iteration, the for loop calls the “+” function, the “[” function, etc.

Let’s apply this to the very first exercise that we did in this class: writing a function that computes the correlation between two sets of numbers. Below, I am showing two different ways of writing this function and comparing their execution time (as well as the execution time of the base R function cor()):

# Let's start by making two vectors of 1,000 random values:
v1 <- rnorm(n = 1000, mean = 10, sd = 4)
v2 <- rnorm(n = 1000, mean = 10, sd = 4)

# This version uses a for loop and is extremely inefficient.
myCor_notEfficient <- function(x,y){
  num <- 0
  denox <- 0
  denoy <- 0
  for(i in (1:length(x))){
    num <- num + ( (x[i]-mean(x))*(y[i]-mean(y)) )
    denox <- denox + (x[i]-mean(x))^2
    denoy <- denoy + (y[i]-mean(y))^2
  }

  res <- num / sqrt(denox*denoy) 
  res
}


# This version is fairly efficient
myCor_efficient <- function(x,y){
  num = mean(x*y) - mean(x)*mean(y)
  denum = sqrt( (mean(x^2)-mean(x)^2) * (mean(y^2)-mean(y)^2)   )
  res = num/denum
  res
}


# Minimizing the number of calls to the function mean by pre-calculating mean(x) and mean(y)
myCor_superEfficient <- function(x,y){
  meanx <- mean(x)
  meany <- mean(y)
  num = mean(x*y) - meanx*meany
  denum = sqrt( (mean(x^2)-meanx^2) * (mean(y^2)-meany^2)   )
  res = num/denum
  res
}


# Making sure that all 3 versions give the correct answer:
cor(v1, v2)
## [1] -0.006671382
myCor_efficient(v1, v2)
## [1] -0.006671382
myCor_superEfficient(v1, v2)
## [1] -0.006671382
myCor_notEfficient(v1, v2)
## [1] -0.006671382
microbenchmark(myCor_notEfficient(v1, v2), myCor_efficient(v1, v2), myCor_superEfficient(v1, v2), cor(v1,v2))
## Unit: microseconds
##                          expr     min       lq      mean   median       uq
##    myCor_notEfficient(v1, v2) 27544.3 29892.25 33752.313 31072.20 36302.55
##       myCor_efficient(v1, v2)    48.0    54.70   169.832    62.00    84.05
##  myCor_superEfficient(v1, v2)    35.8    41.25   145.123    46.95    70.50
##                   cor(v1, v2)    25.3    32.05    61.819    63.05    70.50
##      max neval
##  63803.1   100
##   9421.3   100
##   8779.6   100
##    197.8   100

Note that even the notation used to access elements in a data frame can influence the speed at which the operation is performed. In other words: t$column[row] is not relying on the exact same code as t[row,column]. Illustration:

NB_ROWS <- 10000
t <- data.frame(
  c1 = rnorm(n = NB_ROWS, mean = 10, sd = 2),
  c2 = rnorm(n = NB_ROWS, mean = 10, sd = 2),
  c3 = rnorm(n = NB_ROWS, mean = 10, sd = 2)
)

microbenchmark(t[8000,"c2"], t[8000,2], t$c2[8000])
## Unit: microseconds
##           expr  min    lq   mean median    uq   max neval
##  t[8000, "c2"] 16.9 19.50 24.787   20.5 21.00 263.3   100
##     t[8000, 2] 17.1 19.65 34.113   20.5 21.45 830.9   100
##     t$c2[8000]  1.3  1.60  2.110    2.0  2.20   5.7   100

2. Using the apply functions

There is not always an easy way to avoid programming a loop in R. For example, if you want to compute the average of certain columns on a line per line basis in a data frame:

# Let's build a small data frame with 10 rows and 3 columns:
NB_ROWS <- 10
t <- data.frame(
  c1 = rnorm(n = NB_ROWS, mean = 10, sd = 2),
  c2 = rnorm(n = NB_ROWS, mean = 10, sd = 2),
  c3 = rnorm(n = NB_ROWS, mean = 10, sd = 2)
)

t
##           c1        c2        c3
## 1  10.718634 10.185278 10.123133
## 2  13.152707  8.980474  8.858445
## 3  10.874582  7.594101  9.871553
## 4   9.798828 11.890276 10.645497
## 5   9.001496  9.336788 13.466145
## 6   8.895084 12.432758  7.250345
## 7  12.218979 10.401939 11.310439
## 8   7.568366 15.089725  8.474312
## 9  10.442795  8.752577 11.831971
## 10  8.192511 10.087636 11.824321

To compute the average of these three columns on a row by row basis, you cannot simply write:

t$avg <- mean(t$c1, t$c2, t$c3)

(You can try and see the error message…)

The basic solution would be to use a for loop and compute the average for each line:

myAvg_v1 <- function(t, cNames){
  t$avg <- NA
  nbRows <- nrow(t)
  nbc <- length(cNames)
  for(i in (1:nbRows)){
    total <- 0
    for(cName in cNames){
      total <- total + t[i,cName]
    }
    t$avg[i] <- total/nbc
  }
  t
}

Which can be optimized by using the the apply() function:

myAvg_v2 <- function(t, cNames){
  t$avg <- NA
  t$avg <- apply(t[ , cNames], 1, mean)
  t
}

Let’s compare these two solutions (using a slightly larger data frame):

NB_ROWS <- 1000
t <- data.frame(
  c1 = rnorm(n = NB_ROWS, mean = 10, sd = 2),
  c2 = rnorm(n = NB_ROWS, mean = 10, sd = 2),
  c3 = rnorm(n = NB_ROWS, mean = 10, sd = 2)
)

cNames <- c("c1", "c2", "c3")
microbenchmark(myAvg_v1(t, cNames), myAvg_v2(t, cNames))
## Unit: milliseconds
##                 expr     min       lq     mean  median       uq     max neval
##  myAvg_v1(t, cNames) 37.8853 39.54680 44.60947 42.9156 47.72330 71.7375   100
##  myAvg_v2(t, cNames)  6.4818  7.28025  8.81870  7.8488  9.18015 41.3966   100

Note that we could have applied a little trick to make our function a lot more efficient, without having to use apply:

myAvg_v3 <- function(t, cNames){
  vTot <- t[ , cNames[1]]
  nbNames <- length(cNames)
  for(i in (2:nbNames)){
    vTot <- vTot + t[ , cNames[i]]
  }
  t$avg <- vTot / 3
  t
}

cNames <- c("c1", "c2", "c3")
microbenchmark(myAvg_v1(t, cNames), myAvg_v2(t, cNames), myAvg_v3(t, cNames))
## Unit: microseconds
##                 expr     min       lq      mean   median      uq      max neval
##  myAvg_v1(t, cNames) 36669.3 39529.20 50382.827 41882.90 49818.1 152301.9   100
##  myAvg_v2(t, cNames)  6286.5  7239.95  9455.131  8089.45  9688.2  27057.8   100
##  myAvg_v3(t, cNames)    28.3    44.55   115.044    49.75    71.4   4877.2   100

Other functions of the same family include lapply (to perform an action on a list or vector).

3. Memory allocation

All the variables used in your code are stored in the computer’s memory (the RAM). You can see the amount of memory used by objects in R with the object.size() function:

v1 <- (1:10)
object.size(v1)
## 96 bytes
v2 <- (1:20)
object.size(v2)
## 176 bytes

Every time that a new bloc of memory needs to be used, R has to request it from the operating system. Additionally, for objects such as vectors, the need for contiguous blocs of memory can lead to moving the entire vector to a different bloc of memory when its size is growing. This process can take a lot of time.

Whenever possible, you should pre-allocate the memory needed for a vector before filling it, rather than increasing its size in incremental steps. Let’s illustrate this with two ways of writing a function that fills a vector of N elements with all integers from 1 to N:

# Inefficient version which simply adds a new element at the end of the vector for every iteration
fillVector_inefficient <- function(N){
  v <- c()
  for(i in (1:N)){
    v <- c(v, i) # Here, we add the value of 'i' at the end of vector v. 
    #The growing vector size will require memory allocation calls.
  }
  v
}


fillVector_efficient <- function(N){
  v <- numeric(length = N) # I pre-allocate my vector. By defult, all N values are set to zero.
  for(i in (1:N)){
    v[i] <- i # Here, we simply change the value at position 'i' of the vector
    #No memory allocation is taking place.
  }
  v
}

N <- 1000
# Benchmark, with the two functions + the seq function provided by R.
microbenchmark(fillVector_inefficient(N), fillVector_efficient(N), seq(from=1, to=N, by=1))
## Unit: microseconds
##                           expr    min      lq     mean  median      uq     max
##      fillVector_inefficient(N) 1188.0 2026.45 3102.432 2494.35 3115.50 21289.0
##        fillVector_efficient(N)   40.4   42.55   99.931   47.15   71.60  3184.2
##  seq(from = 1, to = N, by = 1)   22.0   28.35   82.238   59.80  111.45   601.0
##  neval
##    100
##    100
##    100

4. Efficient Inpu/Output (I/O)

Generally speaking, accessing data in memory is fast, but accessing data on a hard drive is “slow” (solid-state drives are closing the gap, but they are expensive and not yet a viable solution for large storage space).

Therefore, optimizing disk access (= writing or reading data to.from a hard drive) can speed up things considerably (it can also decrease the physical stress on hard drives, prolongating their lifespan).

The two main ways of optimizing I/O are to limit I/O operations (thank you Captain Obvious…) and use binary file formats instead of plain text. Here, we will benchmark the use of binary file formats in R. We will use the package rio. Note, you will most likely be prompted to also perform install_formats() to take advantage of all the functionalities from the rio package

# Let's make a (not too) big data frame:
NB_ROWS <- 10000
t <- data.frame(
  c1 = rnorm(n = NB_ROWS, mean = 10, sd = 2),
  c2 = rnorm(n = NB_ROWS, mean = 10, sd = 2),
  c3 = rnorm(n = NB_ROWS, mean = 10, sd = 2)
)

# Using the base R function "write.table" to export the data into a tabulated text file
myWrite_base <- function(t){
  write.table(t, file = "table-base.tab", sep = "\t", quote = F, row.names = F)
}

# Using saveRDS to write data in a binary format
myWrite_RDS <- function(t){
  saveRDS(t, "table.rds")
}

microbenchmark(myWrite_base(t), myWrite_RDS(t), times = 5)
## Unit: milliseconds
##             expr     min      lq     mean  median      uq      max neval
##  myWrite_base(t) 87.2148 87.6697 98.23888 89.3239 90.0691 136.9169     5
##   myWrite_RDS(t) 14.5884 15.2595 16.75552 15.9562 17.1223  20.8512     5
# You will also notice that the binary file is a lot smaller:
file.size("table-base.tab")/1e3
## [1] 516.694
file.size("table.rds")/1e3
## [1] 224.436
# We can also compare the speed performance of read.table vs readRDS to read data from a file:
myRead_base <- function(fName){
  t <- read.table(file = fName)
  t
}

myRead_RDS <- function(fName){
  t <- readRDS(fName)
  t
}

microbenchmark(myRead_base("table-base.tab"), myRead_RDS("table.rds"), times = 5)
## Unit: milliseconds
##                           expr     min      lq     mean  median      uq     max
##  myRead_base("table-base.tab") 11.1124 12.1749 13.82092 12.7837 16.3505 16.6831
##        myRead_RDS("table.rds")  1.6245  2.0365  2.27078  2.0643  2.2165  3.4121
##  neval
##      5
##      5
# Note that I ask microbenchmark to perform the operation only 5 times to avoid waiting for too long and to not stress the hard drive too much.

The main advantage of plain text files is that you can open them and inspect them using a regular text editor software. This is not possible with a binary file (which will look like a string of strange characters if you try to open it with a text editor).

Both write.table and saveRDS are part of the R base package. In some situations, it can be useful to read/write files in formats that are defined by other programs. For example, you might want to read/write files in Microsoft Excel xlsx format. That’s when the rio package becomes extremely useful:

library("rio")
export(t, "test.xlsx")

You can now open the “test.xlsx” with Excel (or LibreOffice, …)

5. Efficient data structures


Up to this point, we have relied mostly on data.frame object to store 2-dimensional arrays in memory. While this is perfectly fine, there are alternatives that offer some nice advantages.

In this section, we will mainly take advantage of tibbles, which can be seen as improved data.frames.

Run the following code, and observe the differences in output when you display the object (data.frame or tibble) in the R console.

t <- read.csv("data/2020-12-weather.csv")
t

library(tibble)
library(tidyverse)
tb <- read_csv("data/2020-12-weather.csv", comment = "#")
tb

# Note that, you can also transform data.frames into tibbles:
tb2 <- tibble(t)
tb2


Data manipulation is also a lot easier with tibbles compared to data.frames

# Efficient renaming of columns:
tb <- rename(tb, tavg = "temperature_avg")

# Easy selection of rows that fit a given condition:
filter(tb, tavg > 5)

# Note that this last command does not modify the tb object. 
# If you want to save the result in a variable, you would have to do something like this:

hotDays <- filter(tb, tavg > 5)


One application where tibbles are extremely useful is when you need to make wide tables long. For example, the following table with temperatures from two weather stations is in a wide format (= one column per station)

## # A tibble: 4 × 2
##   starkville columbus
##        <dbl>    <dbl>
## 1         12       13
## 2         10       11
## 3         15       14
## 4         16       17


If you want to plot this data with ggplot, using a different color/shape per station, you need the table to be in this long format:

## # A tibble: 8 × 2
##   station    temperature
##   <chr>            <dbl>
## 1 starkville          12
## 2 columbus            13
## 3 starkville          10
## 4 columbus            11
## 5 starkville          15
## 6 columbus            14
## 7 starkville          16
## 8 columbus            17


Performing this operation would be cumbersome and painful using only base R functions. With tibbles, it is as easy as this one line of code:

tblg <- pivot_longer(tbl, all_of(colnames(tbl)), names_to = "station", values_to = "temperature")

6. Chaining operations with the Pipe operator

The pipe operator in R is written with this symbol: %>%
It passes the result of a command to the next command. For example:

tbl$temperature %>% mean(na.rm = T)
## Warning: Unknown or uninitialised column: `temperature`.
## Warning in mean.default(., na.rm = T): argument is not numeric or logical:
## returning NA
## [1] NA
# This provides the content of tbl$temperature to the first argument of the function mean

A more useful example… Let’s select from the tibble tb all the rows for which the average temperature was greater or equal to 5 but also less or equal to 6 and the maximum temperature was greater than 10. The elegant solution with the pipe operator is:

library(tidyverse)

tb <- read_csv("data/2020-12-weather.csv", comment = "#")

filter(tb, temperature_avg >= 5) %>% 
  filter(temperature_avg <= 6) %>%
  filter(temperature_max > 10)
## # A tibble: 3 × 9
##   Date       temperatu…¹ tempe…² tempe…³ humid…⁴ humid…⁵ humid…⁶ wind_…⁷ wind_…⁸
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 12/5/2020         17.1     5.9    -1.3    0.92    0.75    0.3      6.5     0.2
## 2 12/15/2020        10.8     5.1    -1.5    0.9     0.74    0.51     7.9     0.3
## 3 12/19/2020        14.1     5.9    -2.4    0.91    0.74    0.37    14       0.7
## # … with abbreviated variable names ¹​temperature_max, ²​temperature_avg,
## #   ³​temperature_low, ⁴​humidity_high, ⁵​humidity_avg, ⁶​humidity_low, ⁷​wind_max,
## #   ⁸​wind_avg

Conclusions

There are many ways to obtain the same result in R. After completing this lab, you should now be familiar with some of the main ways to improve the efficiency of your R code. When your code takes too long to run, come back to these principles and see if you can apply any of the optimization tricks presented in this lab. You can also use the package profvis to easily find (and visualize) which parts of your code are taking the most time (and therefore need to be optimized, or removed!)