Gene expression in budding yeast responding to alcohols with and without oxygen

The goal of this lab is to analyze transcriptomic data from a previously published dataset. The data used in this lab comes from a paper published in 2018: Genotype-by-Environment-by-Environment Interactions in the Saccharomyces cerevisiae Transcriptomic Response to Alcohols and Anaerobiosis (https://pubmed.ncbi.nlm.nih.gov/30301737/)

The full dataset is available on NCBI GEO/SRA (two linked databases that contain Gene Expression/Short Read sequences data) at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118069

The data is made of RNAseq reads from several yeast strains (IL01, NCY3290 and Y7568), exposed to different agents (Butanol, Ethanol, Isobutanol, YPD) and grown under aerobic or anaerobic condition.

While the analyzes done in this paper are fairly complex (due to the multi-factorial design of the experiment), we will focus on differences in gene expression between aerobic and anaerobic growth conditions and compare the results obtained with those of the 2006 study presented in class (https://pubmed.ncbi.nlm.nih.gov/17322208/).

Initial quantification step

Many RNAseq pipelines use the tool tophat2 to map reads to a reference genome and then cufflinks to estimate the abundance (RPKM or FPKM) of each gene. Here, we will use the new generation of tools that use pseudoalignment. Pseudoalignement relies on kmers to find the most likely transcript origin for each sequencing read (see: https://pachterlab.github.io/kallisto/about and https://tinyheero.github.io/2015/09/02/pseudoalignments-kallisto.html). These methods are a LOT faster than traditional alignment methods.
The program used here to perform this step is Kallisto (https://pachterlab.github.io/kallisto/manual). All the sequencing files are pre-downloaded and available in: /mnt/scratch/goutfs/CMB8013/RNAseq/SRP156315/

Our first step will be to run the Kallisto pseudomapping program for each of these files. Kallisto requires a first step of indexing the sequences against which RNAseq reads will be pseduomapped. The index is already available at: /mnt/scratch/goutfs/CMB8013/RNAseq/index/transcripts.idx. It was obtained with:

cd /mnt/scratch/goutfs/CMB8013/RNAseq/index
module load Kallisto
kallisto index -i transcripts.idx Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz

Exercise: write a bash script to automatically run Kallisto on each sequencing file.

The file Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz contains all the predicted transcripts for the yeast genome and was downloaded from the ensembl website.
The pseudomapping/quantification step is performed with the following general command (for unpaired reads of size 300 with sd 50):

kallisto quant -i [indexFile] -o [outputFolder] --single -l 300 -s 50 [reads.fastq.gz]

It would be quite cumbersome to type this command 48 times (for the 48 files that need to be processed). Therefore, we will write a bash script to automatize this process. The general structure of the script should be as follows:

#!/bin/bash

For each of the SRR*.fastq files
  Extract the SRR number from the file name --> to be used for the output folder name
  Prepare file names for slurm, log and err (for example: slurm_SRR7642939.sh, SRR7642939.log and SRR7642939.err)
  Write the corresponding slurm command into the slurm file
  submit the slurm file to the queue
done

For the second part of this lab, you will need to have the files generated by Kallisto. If you ran into some issues with this first part, you can download the corresponding files (solution) from: /mnt/scratch/goutfs/CMB8013/RNAseq/Abundances
You will also need to use a brief summary of the experiment, which is available in the text-tabulated file: /mnt/scratch/goutfs/CMB8013/RNAseq/experiment.tab

You can download these files to your computer and work locally or perform all the steps on Lugh.

Analysis in R

This roadmap is following the steps given in the official DESeq2 tutorial available here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Preparing the data:

Look at the content of the experiment.tab file:

txp = read.table("experiment.tab", h=T, sep="\t")
Run agent BioProject BioSample SampleName Strain oxy
SRR7642929 YPD PRJNA484406 SAMN09764590 GSM3318008 Y7568 AER
SRR7642930 YPD PRJNA484406 SAMN09764636 GSM3318009 Y7568 AER
SRR7642931 YPD PRJNA484406 SAMN09764635 GSM3318010 NCY3290 ANAE
SRR7642932 YPD PRJNA484406 SAMN09764634 GSM3318011 NCY3290 ANAE
SRR7642933 YPD PRJNA484406 SAMN09764633 GSM3318012 NCY3290 AER
SRR7642934 Isobutanol PRJNA484406 SAMN09764632 GSM3318013 NCY3290 ANAE
SRR7642935 Isobutanol PRJNA484406 SAMN09764631 GSM3318014 NCY3290 ANAE
SRR7642936 YPD PRJNA484406 SAMN09764629 GSM3318015 NCY3290 AER
SRR7642937 YPD PRJNA484406 SAMN09764628 GSM3318016 IL01 AER
SRR7642938 Butanol PRJNA484406 SAMN09764627 GSM3318017 Y7568 AER
SRR7642939 Butanol PRJNA484406 SAMN09764626 GSM3318018 Y7568 AER
SRR7642940 Isobutanol PRJNA484406 SAMN09764625 GSM3318019 IL01 ANAE
SRR7642941 Isobutanol PRJNA484406 SAMN09764630 GSM3318020 IL01 ANAE
SRR7642942 Butanol PRJNA484406 SAMN09764623 GSM3318021 NCY3290 ANAE
SRR7642943 Butanol PRJNA484406 SAMN09764589 GSM3318022 NCY3290 ANAE
SRR7642944 Butanol PRJNA484406 SAMN09764619 GSM3318023 IL01 AER
SRR7642945 Butanol PRJNA484406 SAMN09764618 GSM3318024 IL01 AER
SRR7642946 YPD PRJNA484406 SAMN09764617 GSM3318025 Y7568 ANAE
SRR7642947 YPD PRJNA484406 SAMN09764624 GSM3318026 Y7568 ANAE
SRR7642948 Ethanol PRJNA484406 SAMN09764616 GSM3318027 NCY3290 AER
SRR7642949 Ethanol PRJNA484406 SAMN09764615 GSM3318028 NCY3290 AER
SRR7642950 YPD PRJNA484406 SAMN09764614 GSM3318029 IL01 ANAE
SRR7642951 YPD PRJNA484406 SAMN09764613 GSM3318030 IL01 ANAE
SRR7642952 Isobutanol PRJNA484406 SAMN09764612 GSM3318031 IL01 AER
SRR7642953 Isobutanol PRJNA484406 SAMN09764611 GSM3318032 Y7568 AER
SRR7642954 Isobutanol PRJNA484406 SAMN09764610 GSM3318033 Y7568 AER
SRR7642955 Butanol PRJNA484406 SAMN09764609 GSM3318034 NCY3290 AER
SRR7642956 Butanol PRJNA484406 SAMN09764608 GSM3318035 NCY3290 AER
SRR7642957 Ethanol PRJNA484406 SAMN09764607 GSM3318036 Y7568 AER
SRR7642958 Ethanol PRJNA484406 SAMN09764606 GSM3318037 Y7568 AER
SRR7642959 Isobutanol PRJNA484406 SAMN09764605 GSM3318038 NCY3290 AER
SRR7642960 Ethanol PRJNA484406 SAMN09764604 GSM3318039 Y7568 ANAE
SRR7642961 Ethanol PRJNA484406 SAMN09764603 GSM3318040 Y7568 ANAE
SRR7642962 Isobutanol PRJNA484406 SAMN09764602 GSM3318041 NCY3290 AER
SRR7642963 Ethanol PRJNA484406 SAMN09764601 GSM3318042 IL01 AER
SRR7642964 Ethanol PRJNA484406 SAMN09764600 GSM3318043 IL01 AER
SRR7642965 Isobutanol PRJNA484406 SAMN09764599 GSM3318044 IL01 AER
SRR7642966 Isobutanol PRJNA484406 SAMN09764598 GSM3318045 Y7568 ANAE
SRR7642967 Isobutanol PRJNA484406 SAMN09764597 GSM3318046 Y7568 ANAE
SRR7642968 Ethanol PRJNA484406 SAMN09764596 GSM3318047 NCY3290 ANAE
SRR7642969 Ethanol PRJNA484406 SAMN09764595 GSM3318048 NCY3290 ANAE
SRR7642970 Butanol PRJNA484406 SAMN09764594 GSM3318049 Y7568 ANAE
SRR7642971 Ethanol PRJNA484406 SAMN09764593 GSM3318050 IL01 ANAE
SRR7642972 Ethanol PRJNA484406 SAMN09764592 GSM3318051 IL01 ANAE
SRR7642973 Butanol PRJNA484406 SAMN09764591 GSM3318052 Y7568 ANAE
SRR7642974 YPD PRJNA484406 SAMN09764622 GSM3318053 IL01 AER
SRR7642975 Butanol PRJNA484406 SAMN09764621 GSM3318054 IL01 ANAE
SRR7642976 Butanol PRJNA484406 SAMN09764620 GSM3318055 IL01 ANAE

In this experiment, multiple variables change from one sample to the other: the agent (YPD, Isobutanol, Butanol or Ethanol), the strain and the growth condition (aerobic vs anaerobic, column “oxy”).

Let’s add a column with the location of the abundance file (generated by Kallisto) for each sample. On my computer, I have all these files in a folder named “Abundances”. The file names are [Run].tsv

txp$file = paste("Abundances/", txp$Run, ".tsv", sep="")

kable(txp[1:4,])
Run agent BioProject BioSample SampleName Strain oxy file
SRR7642929 YPD PRJNA484406 SAMN09764590 GSM3318008 Y7568 AER Abundances/SRR7642929.tsv
SRR7642930 YPD PRJNA484406 SAMN09764636 GSM3318009 Y7568 AER Abundances/SRR7642930.tsv
SRR7642931 YPD PRJNA484406 SAMN09764635 GSM3318010 NCY3290 ANAE Abundances/SRR7642931.tsv
SRR7642932 YPD PRJNA484406 SAMN09764634 GSM3318011 NCY3290 ANAE Abundances/SRR7642932.tsv

Let’s load all the packages needed for the analysis. These packages will allow us to query the ensembl database online and obtain the list of genes/transcripts from a given organism (in this case: the yeast Saccharomyces cerevisiae).

library("tximport")
library("DESeq2")
library("ggplot2")

library("ensembldb")
library(AnnotationHub)

Obtaining gene/transcript identifiers:

################################################################################
# We start by retrieving information from ensembl about gene/transcripts names:
# Our goal here is to create a two columns table which will link the transcripts 
# to their respective genes.

ah <- AnnotationHub()
#ahDb <- query(ah, pattern = c("Saccharomyces cerevisiae", "EnsDb", 103))
#ahDb
edb <- ah[["AH89534"]]

txs <- transcripts(edb, return.type = "DataFrame")
# Unfortunately, the transcript IDs returned here have an added "_mRNA" text at the end.
# Let's remove it:
txs$tID = gsub("_mRNA", "", txs$tx_id)
tl = txs[ , c("tID", "gene_id")]
colnames(tl) = c("TXNAME", "GENEID")

# To avoid doing this step again, we can save the "tl" data.frame (which contains 
# the relationships between transcripts and genes) in a text file (and simply 
# read this file next time we want to analyze the data, with maybe different 
# parameters):

write.table(tl, file="gene_transcript_link.tab", col.names=T, row.names=F, quote=F, sep="\t")

tx2gene = tl

# Next time, you can skip all this code above and simply use:

tx2gene = read.table("gene_transcript_link.tab", h=T, sep="\t")

Importing the data in R (using TXI and DESeq2):

# Now, let's use DESeq2 to analyze the data:

# The columns containing variables of interest need to be transformed into factors (see: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#note-on-factor-levels and https://www.stat.berkeley.edu/~s133/factors.html)
txp$agent = factor(txp$agent)
txp$Strain = factor(txp$Strain)
txp$oxy = factor(txp$oxy)

# Importing all the abundance calculations performed by Kallisto:
txi <- tximport(txp$file, type="kallisto", tx2gene=tx2gene)


# Creating the DESeq dataset:
# The most important part here is the "design" formula which tells DESeq2 that we are interested in contrasting aerobic vs anaerobic (column "oxy") but that it should know that two other variables (agent and Strain) can contribute to the variance in expression level. 
ddsTxi <- DESeqDataSetFromTximport(txi, colData = txp, design = ~ agent + Strain + oxy)

# These next 3 lines will create a matrix of transcript abundances which you can use to look at values in each sample
ddsc <- estimateSizeFactors(ddsTxi)
c = counts(ddsc, normalized=TRUE)
colnames(c) = paste(txp$Run, txp$oxy, txp$agent, txp$SampleName, sep="_")

# Let's remove genes that have low expression level throughout the experiment.
# The advantage of doing so is that fewer genes in the analysis = more statistical power during the correction for multiple testing.
# The cutoff at 200 is arbitrary.
MIN_ROW_SUM = 20
keep <- which( rowSums(counts(ddsTxi)) >= MIN_ROW_SUM )
length(keep)
## [1] 6234
ddsTxi <- ddsTxi[keep,]

# How many genes were removed by this step?

# Now, we let DESeq perform its magic...
dds <- DESeq(ddsTxi)
DESeq2res = results(dds, cooksCutoff = Inf)
DESeq2res
## log2 fold change (MLE): oxy ANAE vs AER 
## Wald test p-value: oxy ANAE vs AER 
## DataFrame with 6234 rows and 6 columns
##            baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##           <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Q0010       2.28137      -0.335161 0.3522456 -0.951498 3.41352e-01 3.83629e-01
## Q0045   15258.63531      -0.574666 0.0915521 -6.276931 3.45322e-10 7.95836e-10
## Q0050   10817.95191      -0.995330 0.3083073 -3.228370 1.24498e-03 1.88059e-03
## Q0055    3424.40498       0.712579 0.8187071  0.870372 3.84097e-01 4.26973e-01
## Q0060     218.43571      -0.194925 0.3618107 -0.538750 5.90060e-01 6.29330e-01
## ...             ...            ...       ...       ...         ...         ...
## YPR200C    10.60230      -0.239324  0.265586 -0.901119  0.36752520  0.41038010
## YPR201W    58.06971       0.380461  0.239714  1.587149  0.11247891  0.13754286
## YPR202W   190.41130       0.339468  0.113332  2.995358  0.00274123  0.00402829
## YPR203W     3.13892       0.579292  0.398659  1.453101  0.14619560  0.17545618
## YPR204W   121.82286       0.665697  0.498179  1.336259  0.18146449  0.21461765

Exercise:

How many genes have a significant differential expression between aerobic and anaerobic growth (at the alpha level of 0.05) before and after correction for multiple testing?

Modify the code to select only the samples with Butanol or Isobutanol and find the genes that are deferentially expressed between these two agents. Here is the result that you should obtain:

## log2 fold change (MLE): agent Isobutanol vs Butanol 
## Wald test p-value: agent Isobutanol vs Butanol 
## DataFrame with 5834 rows and 6 columns
##          baseMean log2FoldChange     lfcSE        stat    pvalue      padj
##         <numeric>      <numeric> <numeric>   <numeric> <numeric> <numeric>
## Q0045   15418.863     0.04153796  0.111022  0.37414119  0.708299  0.951690
## Q0050   12676.026     0.00125003  0.159697  0.00782754  0.993755  0.999187
## Q0055    3376.244    -0.08985234  0.280942 -0.31982549  0.749101  0.958376
## Q0060     264.849     0.24323798  0.191124  1.27267240  0.203134  0.669540
## Q0065    2297.199     0.25096109  0.171878  1.46011050  0.144260  0.573934
## ...           ...            ...       ...         ...       ...       ...
## YPR199C  198.6523      0.0389282 0.0900152    0.432462 0.6654056  0.939745
## YPR200C   12.6383      0.5795473 0.2553661    2.269476 0.0232394  0.216926
## YPR201W   75.7631      0.5022001 0.2185290    2.298093 0.0215565  0.206165
## YPR202W  173.8028      0.2622094 0.1452016    1.805830 0.0709448  0.401837
## YPR204W  113.7776     -0.3492797 0.6858166   -0.509290 0.6105488  0.915446

Visualizing data

We can look at the abundances for any given gene with the following code:

# Looking at gene YDR518W
geneID = "YDR518W"
plotCounts(dds, gene=geneID, intgroup="oxy")

# We can even split the data by aero/aner + agent to see if the increased/decreased expression level is specific to a certain agent
plotCounts(dds, gene=geneID, intgroup=c("oxy", "agent"))

# Or specific to a certain strain?
plotCounts(dds, gene=geneID, intgroup=c("oxy", "Strain"))

# Let's make these plots nicer with ggplot. For this, we will capture the data returned by "plotCounts" and then use ggplot to perform the actual ploting:
d <- plotCounts(dds, gene=geneID, intgroup=c("oxy"), returnData=TRUE)
ggplot(d, aes(x=oxy, y=count) ) + geom_point(position=position_jitter(w=0.1,h=0))

d <- plotCounts(dds, gene=geneID, intgroup=c("oxy", "agent"), returnData=TRUE)
ggplot(d, aes(x=interaction(oxy,agent), y=count, color=oxy, shape=agent)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) +
  theme(axis.text.x=element_text(angle=45,hjust=1))

Exercise:

  1. The 2006 study found that only 5 out of 27 genes tested confer an advantage for anaerobic growth. Are these 5 genes significantly more highly expressed in anaerobic vs aerobic condition in this new experiment? The list of genes is: eug1, izh2, plb2, ylr413w and yor012w. You can convert the gene name into its corrsponding systematic ID by searching for it on: https://www.yeastgenome.org/

For example, the ID for eug1 in this experiment is: YDR518W

  1. The main result from the 2006 study was that many genes with increased expression level under anaerobic conditions did not contribute to the fitness/survival of yeast when grown anaerobically. One possible explanation for this surprising result is that some of the genes in the list used in 2006 are not really over-expressed under anaerobic growth (false positives). One such gene is muc1. Does the new data confirm that muc1 is over-expressed under anaerobic growth? How about the gene ysr3?

Data quality control and clustering

This step can be extremely useful to spot problems with the data. For example, if a sample was incorrectly labeled or switched (these things happen more frequently than they should in the real world…). Here, we will perform a clustering of the samples to detect any problem.

library("pheatmap")
library("vsn")

# Normalizing the data
blind = F
vsd <- vst(dds, blind=blind)

# We will use the top 100 genes with the highest expression level to perform the clustering:
nbTopGenes = 100
select <- order(rowMeans(counts(dds,normalized=TRUE)), decreasing=TRUE)[1:nbTopGenes]
df <- as.data.frame(colData(dds)[,c("oxy","agent", "Strain")])
ph <- pheatmap(assay(vsd)[select,], cluster_rows=T, show_rownames=F, cluster_cols=T, annotation_col=df)
ph

In this heatmap, each row represent the expression pattern of a gene across all the samples (samples are in columns). Which variable (Strain, agent or oxy) has the most clustering/predictive power? Do you see any odd sample?

Gene ontology analysis.

Genes can be annotated based on their function. A massive effort to annotate the function of genes is the Gene Ontology project: http://geneontology.org/

For example, the gene eug1 (one of the 5 genes from the 2006 study) belongs to the functional category “protein disulfide isomerase activity” (GO:0006467), which is itself a sub-category of the more general “protein folding” category. See: http://amigo.geneontology.org/amigo/gene_product/SGD:S000002926

Gene ontology categories can be of three types: Molecular Function (MF), Biological Process (BP), or Cellular Component (CC). Our eug1 gene is also a member of GO:0005783 (endoplasmic reticulum) in the Cellular Component ontology.

Note that not all genes are assigned a functional category. For many genes, the function that they perform is still unknown.

It is often desirable to find if certain functional categories are over-represented among genes deferentially expressed in an experiment. For example here, we could be interested in finding which types of genes tend to be diferentially expressed in response to anaerobic growth. This type of analysis is called a GO enrichement analysis. Here is a short road map about how to perform this analysis:

library("topGO")
library("org.Sc.sgd.db")


# The main information needed for the GO enrichement analysis is a list of values summarizing the experiment for each gene (what is called the "gene universe")
# The value can be anything, but what is typically used is the p-value on the differential expression test (although the log fold change is also used sometimes)
# The goal is to create a vector with as many elements as there are genes in the analysis. The values inside the vector are the p-values and the names of the elements in the vector are the gene names.
# The following short function creates this vector for us:
getGeneUniverse <- function(deseqRes){
  vgs = as.numeric(deseqRes$pvalue)
  names(vgs) = rownames(deseqRes)
  vgs
}

# The sel_pval function is to be used for some enrichment analysis. Some tests use a hard cutt-off to generate list of genes deferentially expressed. In this case, the cutoff would be a p-value of 0.05.
# When using this strategy, the GO enrichment is done as follows:
# - Let's assume that the gene universe is 10,000 genes.
# - 2,000 of these genes are deferentially expressed (with p-value cutoff of 0.05)
# - 500 of the 10,000 genes are annotated as belonging to GO:00001 (a fictional Gene Ontology category)
# - 400 of the 2,000 belong to GO:00001
# Is GO:00001 over-represented among deferentially expressed genes?
# (The null expectation would be: 500*0.2 = 100)
# This can be tested with a simple Fisher test.
# Note that more sophisticated methods (Such as Kolmogorov–Smirnov) are also available and circumvent the use of arbitrary cutoffs
sel_pval <- function(allScore){ return(allScore < 0.05)}

geneUniverse = getGeneUniverse(DESeq2res)
  
dGO <- as.data.frame(org.Sc.sgdGO)
geneID2GO <- list()
for(geneID in unique(dGO$systematic_name)){
  geneID2GO[[geneID]] <- dGO$go_id[which(dGO$systematic_name==geneID)]
}

ontology <- "BP"
go2genes <- annFUN.gene2GO(whichOnto = ontology, gene2GO = geneID2GO)


# Preparing the topGO analysis for the Biological Process (BP) ontology:
GOdata = new("topGOdata", description="diff expr GO test", ontology= ontology,  
             allGenes = geneUniverse, geneSel = sel_pval, 
             nodeSize = 10, 
             annot = annFUN.GO2genes,
             GO2genes=go2genes)

allGO = usedGO(object = GOdata)
GOresult = runTest(GOdata, statistic = "ks", algorithm = "elim")

allRes <- GenTable(GOdata, pval = GOresult, topNodes = length(allGO))

kable(allRes[1:20,])
GO.ID Term Annotated Significant Expected pval
GO:0002181 cytoplasmic translation 180 174 144.11 < 1e-30
GO:0000462 maturation of SSU-rRNA from tricistronic… 104 101 83.27 1.3e-13
GO:0000027 ribosomal large subunit assembly 43 43 34.43 2.5e-10
GO:0000463 maturation of LSU-rRNA from tricistronic… 52 48 41.63 3.5e-10
GO:0042273 ribosomal large subunit biogenesis 127 122 101.68 1.6e-08
GO:0000447 endonucleolytic cleavage in ITS1 to sepa… 44 43 35.23 4.4e-07
GO:0000480 endonucleolytic cleavage in 5’-ETS of tr… 33 32 26.42 5.2e-07
GO:0000472 endonucleolytic cleavage to generate mat… 34 33 27.22 8.9e-07
GO:0000466 maturation of 5.8S rRNA from tricistroni… 91 86 72.86 9.5e-07
GO:0006407 rRNA export from nucleus 16 16 12.81 1.9e-06
GO:0015986 proton motive force-driven ATP synthesis 18 16 14.41 2.2e-05
GO:0006364 rRNA processing 280 264 224.18 4.4e-05
GO:0000028 ribosomal small subunit assembly 26 24 20.82 4.6e-05
GO:0030488 tRNA methylation 24 22 19.22 5.5e-05
GO:0120009 intermembrane lipid transfer 21 20 16.81 5.9e-05
GO:0006450 regulation of translational fidelity 31 28 24.82 8.6e-05
GO:0048278 vesicle docking 34 32 27.22 0.00013
GO:0034727 piecemeal microautophagy of the nucleus 41 40 32.83 0.00018
GO:0001732 formation of cytoplasmic translation ini… 10 10 8.01 0.00020
GO:0043328 protein transport to vacuole involved in… 28 24 22.42 0.00022

Question:

Based on the name/description of these functional categories, can you identify some that are clearly related to aerobic vs anaerobic growth? Is the functional category cellular respiration significantly over/under represented among diferentially expressed genes?

Now looking at Cellular Components:

ontology <- "CC"

go2genes <- annFUN.gene2GO(whichOnto = ontology, gene2GO = geneID2GO)
GOdata = new("topGOdata", description="diff expr GO test", ontology= ontology,  
             allGenes = geneUniverse, geneSel = sel_pval, 
             nodeSize = 10, 
             annot = annFUN.GO2genes,
             GO2genes=go2genes)

allGO = usedGO(object = GOdata)
GOresult = runTest(GOdata, statistic = "KS", algorithm = "elim")

allRes <- GenTable(GOdata, pval = GOresult, topNodes = length(allGO))

kable(allRes[1:10,])
GO.ID Term Annotated Significant Expected pval
GO:0030686 90S preribosome 90 87 72.12 < 1e-30
GO:0022625 cytosolic large ribosomal subunit 80 78 64.11 < 1e-30
GO:0005730 nucleolus 289 262 231.59 1.0e-15
GO:0022627 cytosolic small ribosomal subunit 59 55 47.28 1.9e-14
GO:0032040 small-subunit processome 53 53 42.47 3.5e-12
GO:0030687 preribosome, large subunit precursor 62 60 49.68 8.5e-11
GO:0000839 Hrd1p ubiquitin ligase ERAD-L complex 12 12 9.62 8.6e-07
GO:0030688 preribosome, small subunit precursor 23 23 18.43 2.3e-06
GO:0005737 cytoplasm 4349 3546 3485.11 3.0e-05
GO:0005736 RNA polymerase I complex 12 12 9.62 3.4e-05