Basic simulations in R

Here is a classic stats problem: you toss a coin 100 times (yes, statisticians are probably bored and have a lot of time to waste…) and observe 42 heads (and 58 tails - the coin never lands on its edge…). What is the probability that this coin is fair? (i.e. has a 50% chance of landing on either side)

Coin toss - Image credit: https://en.wikipedia.org/wiki/File:Coin_Toss_(3635981474).jpg

.

One way to answer this question is to perform 100 coin toss with a fair coin and record the number of times that the coin landed on “heads”. Repeat the experiment 1,000 times. If you never observed 42 or fewer heads in any of the 1,000 experiments, you can conclude that your initial observation had a less than 1 in a thousand chance of happening with a fair coin. In other words, the p-value associated to the rejection of our null hypothesis (the null hypothesis being: the coin is fair) is less than 1/1,000. If you observed 42 or fewer heads in 100 experiments out of 1000, your p-value would be 0.1.

Rather than tossing a coin 100x1,000 times (that would take you ~28 hours, assuming you take one second to toss the coin and record the result…) we can write some code in R to perform this type of simulation and obtain the result in a fraction of a second (+ the few minutes it will take to write the code the first time).

Random drawings in R.

There are a few ways to generate random numbers or randomly re-order elements in R. Here are the basics.
Let’s start by preparing a vector of integer, from 1 to 10, in order:

v = seq(from=1, to=10, by=1)
v
##  [1]  1  2  3  4  5  6  7  8  9 10

We can randomly change the order of elements inside the vector with the function sample:

sample(v)
##  [1]  2  7 10  5  3  1  8  9  4  6

Try to call the function sample a few times to see how you’ll obtain a different result (almost) each time.

Note that the function sample does not actually change the order of elements inside the vector v but it returns a randomized version of the vector. You could save the result in a seperate vector:

vs = sample(v)
vs
##  [1]  4  1  9  5  7  3  8  2 10  6

One way to randomly pick one value out of 10 inside the vector would be:

vs = sample(v)
randomValue = vs[1]
randomValue
## [1] 1

Note that this works also with non-numerical variables:

v = c("A", "B", "C", "D", "E", "F")
sample(v)
## [1] "C" "B" "F" "E" "D" "A"

The function sample can do even more. If you want to randomly draw 2 values from a vector you can simply use sample this way:

# With "replace=F" --> drawing without replacement
sample(v, size=2, replace=F)
## [1] "B" "D"
# Same thing, with replacement:
sample(v, size=2, replace=T)
## [1] "E" "A"
# 20 random drawings:
sample(v, size=20, replace=T)
##  [1] "F" "B" "D" "C" "D" "D" "C" "E" "F" "A" "D" "E" "A" "A" "D" "F" "C" "C" "A"
## [20] "D"

Application to the coin flipping problem.

Let’s start by making a “coin” object. In this case, it will be a vector with two values: “head” and “tail”. We could have coded it with “H” and “T” or even 0 and 1 (or T and F).

coin = c("head", "tail")

# Let's flip the coin 10 times:
sample(coin, size=10, replace=T)
##  [1] "head" "tail" "head" "head" "head" "head" "head" "tail" "tail" "head"

We need to repeat this process many times and record the result (=how many “head”) at each simulation. Below is the code to test the probability of obtaining 42 heads out of 100 coin toss. We will simulate 100 coin toss 10,000 times. If your laptop is slow, you can reduce this number to 1,000 to decrease the load on your processor.

NB_TOSS = 100
NB_HEAD = 42

NB_SIMULATIONS = 10000

vRes = rep(NA, NB_SIMULATIONS) # <- Vector to keep track of the results after each simulation

for(i in (1:NB_SIMULATIONS)){
  vToss = sample(coin, size=NB_TOSS, replace=T)
  nbHead = length(which(vToss=="head"))
  vRes[i] = nbHead
}

We now have, inside the vRes vector a record of how many heads (out of 100 toss) were obtained at each round of simulation.

vRes[1:20]
##  [1] 53 50 50 51 55 45 53 52 53 49 50 54 57 44 57 55 59 49 59 54

We can use a histogram plot to visualize the distribution of results:

hist(vRes, nclass=50, col="steelblue", main="Distribution of #heads per simulation (of 100 toss each)", xlab="#heads")

Let’s add the location of our test number:

hist(vRes, nclass=50, col="steelblue", main="Distribution of #heads per simulation (of 100 toss each)", xlab="#heads")
abline(v=NB_HEAD, col="red")

Computing a p-value.

So, what’s the probability of obtaining 42 or fewer heads out of 100 coin toss? In other words, if you observe 42 heads out of 100 coin toss, what is the probability of being wrong if you reject the null hypothesis (null hypothesis = the coin is fair)?

This is simply the fraction of simulations that resulted in 42 or fewer heads:

nbSuccess = length(which(vRes<=NB_HEAD))
pval = nbSuccess / NB_SIMULATIONS
cat("The probability of obtaining ", NB_HEAD, " head out of ", NB_TOSS, " coin toss is: ", round(pval, 3), "\n", sep="")
## The probability of obtaining 42 head out of 100 coin toss is: 0.065

The shortcut…

Because tossing a coin is a simple binomial process, you can use the suit of pbinom, rbinom, … functions in R to directly compute the probability of obtaining 42 out of 100 coin toss:

pbinom(NB_HEAD, NB_TOSS, prob=0.5, lower.tail=T)
## [1] 0.06660531

Exercise.

Part 1: One more coin flipping problem…

The goal of this exercise is to write some R code to answer the following question: how far from the expected 50/50 ratio does a coin need to deviate to reject the null hypothesis at a given alpha level and as a function of the number of coin flips made.
For example: if I flip a coin 100 times, what is the minimum deviation from the expected 50/50 (head/tail) for which I obtain a p-value < alpha.
Let’s set alpha to 0.01. You can try a few values “by hand”:

pbinom(42, size=100, prob=0.5, lower.tail=T)
## [1] 0.06660531
pbinom(41, size=100, prob=0.5, lower.tail=T)
## [1] 0.04431304
pbinom(40, size=100, prob=0.5, lower.tail=T)
## [1] 0.02844397
pbinom(39, size=100, prob=0.5, lower.tail=T)
## [1] 0.0176001
pbinom(38, size=100, prob=0.5, lower.tail=T)
## [1] 0.01048937
pbinom(37, size=100, prob=0.5, lower.tail=T)
## [1] 0.006016488

In this example, you need to obtain 37 (or fewer) heads (out of 100 flips) to pass the p-value threshold of 0.01. So, the minimum deviation from 50/50 would be:

(50-37)/100
## [1] 0.13

For this exercise, you will need to write a function which finds this minimum deviation for a given number of flips and a given alpha value. Once you have this function working, use it to compute the minimum deviation for coin flips going from 10 to 1,000 (for alpha = 0.01) and plot the results in a simple scatter plot. The result should look like this:

Make sure to write your code in a way that allows you to easily run it with a different set of values (= different values for alpha and min/max number of coin flips). To help you complete this assignment, here is a suggestion for the overall structure of the program (but feel free to try a different approach if you feel comfortable doing so):

getMinFrac <- function(nbFlips, alpha){

  # Start by computing the pvalue obtained with just 1 success away from a perfect 50/50
  # For example, if nbFlips is set to 100, you would compute the pvalue for 49 heads.
  
  # While this p-value is larger than the alpha threshold, keep trying after decreasing the number of success
  # Stop when the number of success is below zero or if your p-value is below the alpha threshold
  
  # If you have reached the test for zero success and the pvalue is still higher than the threshold, return NA
  
  # Otherwise, compute the "distance" from 0.5 at which you are when you crossed the alpha thresold and return this value
}

# Now, you can call the function "getMinFrac" on every value between 10 and 1,000. All what is left to do is record the result for each call of the function, match it to the corresponding "nbFlips" and plot the results.

Part 2: Application to genomics

Use a simple simulation in R to solve the following problem:

Dr. Leavitt is investigating the transcriptomic response of yeast to changes in environmental conditions. She finds that, out of 6,200 protein-coding genes, 311 genes are up-regulated during heat-shock and 234 genes are up-regulated upon cold-shock.
21 genes appear in both dataset (heat-shock and cold-shock). Is this more or fewer than expected just by chance (= if the transcriptomic response to heat and cold shocks are unlinked)?

You will probably need to use the function is.element which was presented in earlier labs…

Make sure to include in your answer a distribution of the expected number of overlapping genes. Below is what this distribution should look like: