The goal of this lab is to analyze transcriptomic data from a previously published dataset. The data used in this lab comes from a paper published in 2018: Genotype-by-Environment-by-Environment Interactions in the Saccharomyces cerevisiae Transcriptomic Response to Alcohols and Anaerobiosis (https://pubmed.ncbi.nlm.nih.gov/30301737/)
The full dataset is available on NCBI GEO/SRA (two linked databases that contain Gene Expression/Short Read sequences data) at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118069
The data is made of RNAseq reads from several yeast strains (IL01, NCY3290 and Y7568), exposed to different agents (Butanol, Ethanol, Isobutanol, YPD) and grown under aerobic or anaerobic condition.
While the analyzes done in this paper are fairly complex (due to the multi-factorial design of the experiment), we will focus on differences in gene expression between aerobic and anaerobic growth conditions and compare the results obtained with those of the 2006 study presented in class (https://pubmed.ncbi.nlm.nih.gov/17322208/).
Many RNAseq pipelines use the tool tophat2 to map
reads to a reference genome and then cufflinks to
estimate the abundance (RPKM or FPKM) of each gene. Here, we will use
the new generation of tools that use pseudoalignment.
Pseudoalignement relies on kmers to find the most likely transcript
origin for each sequencing read (see: https://pachterlab.github.io/kallisto/about and https://tinyheero.github.io/2015/09/02/pseudoalignments-kallisto.html).
These methods are a LOT faster than traditional alignment methods.
The program used here to perform this step is Kallisto
(https://pachterlab.github.io/kallisto/manual). All the
sequencing files are pre-downloaded and available in:
/mnt/scratch/goutfs/CMB8013/RNAseq/SRP156315/
Our first step will be to run the Kallisto pseudomapping program for each of these files. Kallisto requires a first step of indexing the sequences against which RNAseq reads will be pseduomapped. The index is already available at: /mnt/scratch/goutfs/CMB8013/RNAseq/index/transcripts.idx. It was obtained with:
cd /mnt/scratch/goutfs/CMB8013/RNAseq/index
module load Kallisto
kallisto index -i transcripts.idx Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz
The file
Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz
contains all the predicted transcripts for the yeast genome and was
downloaded from the ensembl website.
The
pseudomapping/quantification step is performed with the following
general command (for unpaired reads of size 300 with sd 50):
kallisto quant -i [indexFile] -o [outputFolder] --single -l 300 -s 50 [reads.fastq.gz]
It would be quite cumbersome to type this command 48 times (for the 48 files that need to be processed). Therefore, we will write a bash script to automatize this process. The general structure of the script should be as follows:
#!/bin/bash
For each of the SRR*.fastq files
Extract the SRR number from the file name --> to be used for the output folder name
Prepare file names for slurm, log and err (for example: slurm_SRR7642939.sh, SRR7642939.log and SRR7642939.err)
Write the corresponding slurm command into the slurm file
submit the slurm file to the queue
done
For the second part of this lab, you will need to have the files
generated by Kallisto. If you ran into some issues with this first part,
you can download the corresponding files (solution) from:
/mnt/scratch/goutfs/CMB8013/RNAseq/Abundances
You
will also need to use a brief summary of the experiment, which is
available in the text-tabulated file:
/mnt/scratch/goutfs/CMB8013/RNAseq/experiment.tab
You can download these files to your computer and work locally or perform all the steps on Lugh.
This roadmap is following the steps given in the official DESeq2 tutorial available here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
Look at the content of the experiment.tab file:
txp = read.table("experiment.tab", h=T, sep="\t")
Run | agent | BioProject | BioSample | SampleName | Strain | oxy |
---|---|---|---|---|---|---|
SRR7642929 | YPD | PRJNA484406 | SAMN09764590 | GSM3318008 | Y7568 | AER |
SRR7642930 | YPD | PRJNA484406 | SAMN09764636 | GSM3318009 | Y7568 | AER |
SRR7642931 | YPD | PRJNA484406 | SAMN09764635 | GSM3318010 | NCY3290 | ANAE |
SRR7642932 | YPD | PRJNA484406 | SAMN09764634 | GSM3318011 | NCY3290 | ANAE |
SRR7642933 | YPD | PRJNA484406 | SAMN09764633 | GSM3318012 | NCY3290 | AER |
SRR7642934 | Isobutanol | PRJNA484406 | SAMN09764632 | GSM3318013 | NCY3290 | ANAE |
SRR7642935 | Isobutanol | PRJNA484406 | SAMN09764631 | GSM3318014 | NCY3290 | ANAE |
SRR7642936 | YPD | PRJNA484406 | SAMN09764629 | GSM3318015 | NCY3290 | AER |
SRR7642937 | YPD | PRJNA484406 | SAMN09764628 | GSM3318016 | IL01 | AER |
SRR7642938 | Butanol | PRJNA484406 | SAMN09764627 | GSM3318017 | Y7568 | AER |
SRR7642939 | Butanol | PRJNA484406 | SAMN09764626 | GSM3318018 | Y7568 | AER |
SRR7642940 | Isobutanol | PRJNA484406 | SAMN09764625 | GSM3318019 | IL01 | ANAE |
SRR7642941 | Isobutanol | PRJNA484406 | SAMN09764630 | GSM3318020 | IL01 | ANAE |
SRR7642942 | Butanol | PRJNA484406 | SAMN09764623 | GSM3318021 | NCY3290 | ANAE |
SRR7642943 | Butanol | PRJNA484406 | SAMN09764589 | GSM3318022 | NCY3290 | ANAE |
SRR7642944 | Butanol | PRJNA484406 | SAMN09764619 | GSM3318023 | IL01 | AER |
SRR7642945 | Butanol | PRJNA484406 | SAMN09764618 | GSM3318024 | IL01 | AER |
SRR7642946 | YPD | PRJNA484406 | SAMN09764617 | GSM3318025 | Y7568 | ANAE |
SRR7642947 | YPD | PRJNA484406 | SAMN09764624 | GSM3318026 | Y7568 | ANAE |
SRR7642948 | Ethanol | PRJNA484406 | SAMN09764616 | GSM3318027 | NCY3290 | AER |
SRR7642949 | Ethanol | PRJNA484406 | SAMN09764615 | GSM3318028 | NCY3290 | AER |
SRR7642950 | YPD | PRJNA484406 | SAMN09764614 | GSM3318029 | IL01 | ANAE |
SRR7642951 | YPD | PRJNA484406 | SAMN09764613 | GSM3318030 | IL01 | ANAE |
SRR7642952 | Isobutanol | PRJNA484406 | SAMN09764612 | GSM3318031 | IL01 | AER |
SRR7642953 | Isobutanol | PRJNA484406 | SAMN09764611 | GSM3318032 | Y7568 | AER |
SRR7642954 | Isobutanol | PRJNA484406 | SAMN09764610 | GSM3318033 | Y7568 | AER |
SRR7642955 | Butanol | PRJNA484406 | SAMN09764609 | GSM3318034 | NCY3290 | AER |
SRR7642956 | Butanol | PRJNA484406 | SAMN09764608 | GSM3318035 | NCY3290 | AER |
SRR7642957 | Ethanol | PRJNA484406 | SAMN09764607 | GSM3318036 | Y7568 | AER |
SRR7642958 | Ethanol | PRJNA484406 | SAMN09764606 | GSM3318037 | Y7568 | AER |
SRR7642959 | Isobutanol | PRJNA484406 | SAMN09764605 | GSM3318038 | NCY3290 | AER |
SRR7642960 | Ethanol | PRJNA484406 | SAMN09764604 | GSM3318039 | Y7568 | ANAE |
SRR7642961 | Ethanol | PRJNA484406 | SAMN09764603 | GSM3318040 | Y7568 | ANAE |
SRR7642962 | Isobutanol | PRJNA484406 | SAMN09764602 | GSM3318041 | NCY3290 | AER |
SRR7642963 | Ethanol | PRJNA484406 | SAMN09764601 | GSM3318042 | IL01 | AER |
SRR7642964 | Ethanol | PRJNA484406 | SAMN09764600 | GSM3318043 | IL01 | AER |
SRR7642965 | Isobutanol | PRJNA484406 | SAMN09764599 | GSM3318044 | IL01 | AER |
SRR7642966 | Isobutanol | PRJNA484406 | SAMN09764598 | GSM3318045 | Y7568 | ANAE |
SRR7642967 | Isobutanol | PRJNA484406 | SAMN09764597 | GSM3318046 | Y7568 | ANAE |
SRR7642968 | Ethanol | PRJNA484406 | SAMN09764596 | GSM3318047 | NCY3290 | ANAE |
SRR7642969 | Ethanol | PRJNA484406 | SAMN09764595 | GSM3318048 | NCY3290 | ANAE |
SRR7642970 | Butanol | PRJNA484406 | SAMN09764594 | GSM3318049 | Y7568 | ANAE |
SRR7642971 | Ethanol | PRJNA484406 | SAMN09764593 | GSM3318050 | IL01 | ANAE |
SRR7642972 | Ethanol | PRJNA484406 | SAMN09764592 | GSM3318051 | IL01 | ANAE |
SRR7642973 | Butanol | PRJNA484406 | SAMN09764591 | GSM3318052 | Y7568 | ANAE |
SRR7642974 | YPD | PRJNA484406 | SAMN09764622 | GSM3318053 | IL01 | AER |
SRR7642975 | Butanol | PRJNA484406 | SAMN09764621 | GSM3318054 | IL01 | ANAE |
SRR7642976 | Butanol | PRJNA484406 | SAMN09764620 | GSM3318055 | IL01 | ANAE |
In this experiment, multiple variables change from one sample to the other: the agent (YPD, Isobutanol, Butanol or Ethanol), the strain and the growth condition (aerobic vs anaerobic, column “oxy”).
Let’s add a column with the location of the abundance file (generated by Kallisto) for each sample. On my computer, I have all these files in a folder named “Abundances”. The file names are [Run].tsv
txp$file = paste("Abundances/", txp$Run, ".tsv", sep="")
kable(txp[1:4,])
Run | agent | BioProject | BioSample | SampleName | Strain | oxy | file |
---|---|---|---|---|---|---|---|
SRR7642929 | YPD | PRJNA484406 | SAMN09764590 | GSM3318008 | Y7568 | AER | Abundances/SRR7642929.tsv |
SRR7642930 | YPD | PRJNA484406 | SAMN09764636 | GSM3318009 | Y7568 | AER | Abundances/SRR7642930.tsv |
SRR7642931 | YPD | PRJNA484406 | SAMN09764635 | GSM3318010 | NCY3290 | ANAE | Abundances/SRR7642931.tsv |
SRR7642932 | YPD | PRJNA484406 | SAMN09764634 | GSM3318011 | NCY3290 | ANAE | Abundances/SRR7642932.tsv |
Let’s load all the packages needed for the analysis. These packages will allow us to query the ensembl database online and obtain the list of genes/transcripts from a given organism (in this case: the yeast Saccharomyces cerevisiae).
library("tximport")
library("DESeq2")
library("ggplot2")
library("ensembldb")
library(AnnotationHub)
################################################################################
# We start by retrieving information from ensembl about gene/transcripts names:
# Our goal here is to create a two columns table which will link the transcripts
# to their respective genes.
ah <- AnnotationHub()
#ahDb <- query(ah, pattern = c("Saccharomyces cerevisiae", "EnsDb", 103))
#ahDb
edb <- ah[["AH89534"]]
txs <- transcripts(edb, return.type = "DataFrame")
# Unfortunately, the transcript IDs returned here have an added "_mRNA" text at the end.
# Let's remove it:
txs$tID = gsub("_mRNA", "", txs$tx_id)
tl = txs[ , c("tID", "gene_id")]
colnames(tl) = c("TXNAME", "GENEID")
# To avoid doing this step again, we can save the "tl" data.frame (which contains
# the relationships between transcripts and genes) in a text file (and simply
# read this file next time we want to analyze the data, with maybe different
# parameters):
write.table(tl, file="gene_transcript_link.tab", col.names=T, row.names=F, quote=F, sep="\t")
tx2gene = tl
# Next time, you can skip all this code above and simply use:
tx2gene = read.table("gene_transcript_link.tab", h=T, sep="\t")
# Now, let's use DESeq2 to analyze the data:
# The columns containing variables of interest need to be transformed into factors (see: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#note-on-factor-levels and https://www.stat.berkeley.edu/~s133/factors.html)
txp$agent = factor(txp$agent)
txp$Strain = factor(txp$Strain)
txp$oxy = factor(txp$oxy)
# Importing all the abundance calculations performed by Kallisto:
txi <- tximport(txp$file, type="kallisto", tx2gene=tx2gene)
# Creating the DESeq dataset:
# The most important part here is the "design" formula which tells DESeq2 that we are interested in contrasting aerobic vs anaerobic (column "oxy") but that it should know that two other variables (agent and Strain) can contribute to the variance in expression level.
ddsTxi <- DESeqDataSetFromTximport(txi, colData = txp, design = ~ agent + Strain + oxy)
# These next 3 lines will create a matrix of transcript abundances which you can use to look at values in each sample
ddsc <- estimateSizeFactors(ddsTxi)
c = counts(ddsc, normalized=TRUE)
colnames(c) = paste(txp$Run, txp$oxy, txp$agent, txp$SampleName, sep="_")
# Let's remove genes that have low expression level throughout the experiment.
# The advantage of doing so is that fewer genes in the analysis = more statistical power during the correction for multiple testing.
# The cutoff at 200 is arbitrary.
MIN_ROW_SUM = 20
keep <- which( rowSums(counts(ddsTxi)) >= MIN_ROW_SUM )
length(keep)
## [1] 6234
ddsTxi <- ddsTxi[keep,]
# How many genes were removed by this step?
# Now, we let DESeq perform its magic...
dds <- DESeq(ddsTxi)
DESeq2res = results(dds, cooksCutoff = Inf)
DESeq2res
## log2 fold change (MLE): oxy ANAE vs AER
## Wald test p-value: oxy ANAE vs AER
## DataFrame with 6234 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## Q0010 2.28137 -0.335161 0.3522456 -0.951498 3.41352e-01 3.83629e-01
## Q0045 15258.63531 -0.574666 0.0915521 -6.276931 3.45322e-10 7.95836e-10
## Q0050 10817.95191 -0.995330 0.3083073 -3.228370 1.24498e-03 1.88059e-03
## Q0055 3424.40498 0.712579 0.8187071 0.870372 3.84097e-01 4.26973e-01
## Q0060 218.43571 -0.194925 0.3618107 -0.538750 5.90060e-01 6.29330e-01
## ... ... ... ... ... ... ...
## YPR200C 10.60230 -0.239324 0.265586 -0.901119 0.36752520 0.41038010
## YPR201W 58.06971 0.380461 0.239714 1.587149 0.11247891 0.13754286
## YPR202W 190.41130 0.339468 0.113332 2.995358 0.00274123 0.00402829
## YPR203W 3.13892 0.579292 0.398659 1.453101 0.14619560 0.17545618
## YPR204W 121.82286 0.665697 0.498179 1.336259 0.18146449 0.21461765
How many genes have a significant differential expression between aerobic and anaerobic growth (at the alpha level of 0.05) before and after correction for multiple testing?
Modify the code to select only the samples with Butanol or Isobutanol and find the genes that are deferentially expressed between these two agents. Here is the result that you should obtain:
## log2 fold change (MLE): agent Isobutanol vs Butanol
## Wald test p-value: agent Isobutanol vs Butanol
## DataFrame with 5834 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## Q0045 15418.863 0.04153796 0.111022 0.37414119 0.708299 0.951690
## Q0050 12676.026 0.00125003 0.159697 0.00782754 0.993755 0.999187
## Q0055 3376.244 -0.08985234 0.280942 -0.31982549 0.749101 0.958376
## Q0060 264.849 0.24323798 0.191124 1.27267240 0.203134 0.669540
## Q0065 2297.199 0.25096109 0.171878 1.46011050 0.144260 0.573934
## ... ... ... ... ... ... ...
## YPR199C 198.6523 0.0389282 0.0900152 0.432462 0.6654056 0.939745
## YPR200C 12.6383 0.5795473 0.2553661 2.269476 0.0232394 0.216926
## YPR201W 75.7631 0.5022001 0.2185290 2.298093 0.0215565 0.206165
## YPR202W 173.8028 0.2622094 0.1452016 1.805830 0.0709448 0.401837
## YPR204W 113.7776 -0.3492797 0.6858166 -0.509290 0.6105488 0.915446
We can look at the abundances for any given gene with the following code:
# Looking at gene YDR518W
geneID = "YDR518W"
plotCounts(dds, gene=geneID, intgroup="oxy")
# We can even split the data by aero/aner + agent to see if the increased/decreased expression level is specific to a certain agent
plotCounts(dds, gene=geneID, intgroup=c("oxy", "agent"))
# Or specific to a certain strain?
plotCounts(dds, gene=geneID, intgroup=c("oxy", "Strain"))
# Let's make these plots nicer with ggplot. For this, we will capture the data returned by "plotCounts" and then use ggplot to perform the actual ploting:
d <- plotCounts(dds, gene=geneID, intgroup=c("oxy"), returnData=TRUE)
ggplot(d, aes(x=oxy, y=count) ) + geom_point(position=position_jitter(w=0.1,h=0))
d <- plotCounts(dds, gene=geneID, intgroup=c("oxy", "agent"), returnData=TRUE)
ggplot(d, aes(x=interaction(oxy,agent), y=count, color=oxy, shape=agent)) +
geom_point(position=position_jitter(w=0.1,h=0)) +
theme(axis.text.x=element_text(angle=45,hjust=1))
For example, the ID for eug1 in this experiment is: YDR518W
This step can be extremely useful to spot problems with the data. For example, if a sample was incorrectly labeled or switched (these things happen more frequently than they should in the real world…). Here, we will perform a clustering of the samples to detect any problem.
library("pheatmap")
library("vsn")
# Normalizing the data
blind = F
vsd <- vst(dds, blind=blind)
# We will use the top 100 genes with the highest expression level to perform the clustering:
nbTopGenes = 100
select <- order(rowMeans(counts(dds,normalized=TRUE)), decreasing=TRUE)[1:nbTopGenes]
df <- as.data.frame(colData(dds)[,c("oxy","agent", "Strain")])
ph <- pheatmap(assay(vsd)[select,], cluster_rows=T, show_rownames=F, cluster_cols=T, annotation_col=df)
ph
In this heatmap, each row represent the expression pattern of a gene across all the samples (samples are in columns). Which variable (Strain, agent or oxy) has the most clustering/predictive power? Do you see any odd sample?
Genes can be annotated based on their function. A massive effort to annotate the function of genes is the Gene Ontology project: http://geneontology.org/
For example, the gene eug1 (one of the 5 genes from the 2006 study) belongs to the functional category “protein disulfide isomerase activity” (GO:0006467), which is itself a sub-category of the more general “protein folding” category. See: http://amigo.geneontology.org/amigo/gene_product/SGD:S000002926
Gene ontology categories can be of three types: Molecular Function (MF), Biological Process (BP), or Cellular Component (CC). Our eug1 gene is also a member of GO:0005783 (endoplasmic reticulum) in the Cellular Component ontology.
Note that not all genes are assigned a functional category. For many genes, the function that they perform is still unknown.
It is often desirable to find if certain functional categories are over-represented among genes deferentially expressed in an experiment. For example here, we could be interested in finding which types of genes tend to be diferentially expressed in response to anaerobic growth. This type of analysis is called a GO enrichement analysis. Here is a short road map about how to perform this analysis:
library("topGO")
library("org.Sc.sgd.db")
# The main information needed for the GO enrichement analysis is a list of values summarizing the experiment for each gene (what is called the "gene universe")
# The value can be anything, but what is typically used is the p-value on the differential expression test (although the log fold change is also used sometimes)
# The goal is to create a vector with as many elements as there are genes in the analysis. The values inside the vector are the p-values and the names of the elements in the vector are the gene names.
# The following short function creates this vector for us:
getGeneUniverse <- function(deseqRes){
vgs = as.numeric(deseqRes$pvalue)
names(vgs) = rownames(deseqRes)
vgs
}
# The sel_pval function is to be used for some enrichment analysis. Some tests use a hard cutt-off to generate list of genes deferentially expressed. In this case, the cutoff would be a p-value of 0.05.
# When using this strategy, the GO enrichment is done as follows:
# - Let's assume that the gene universe is 10,000 genes.
# - 2,000 of these genes are deferentially expressed (with p-value cutoff of 0.05)
# - 500 of the 10,000 genes are annotated as belonging to GO:00001 (a fictional Gene Ontology category)
# - 400 of the 2,000 belong to GO:00001
# Is GO:00001 over-represented among deferentially expressed genes?
# (The null expectation would be: 500*0.2 = 100)
# This can be tested with a simple Fisher test.
# Note that more sophisticated methods (Such as Kolmogorov–Smirnov) are also available and circumvent the use of arbitrary cutoffs
sel_pval <- function(allScore){ return(allScore < 0.05)}
geneUniverse = getGeneUniverse(DESeq2res)
dGO <- as.data.frame(org.Sc.sgdGO)
geneID2GO <- list()
for(geneID in unique(dGO$systematic_name)){
geneID2GO[[geneID]] <- dGO$go_id[which(dGO$systematic_name==geneID)]
}
ontology <- "BP"
go2genes <- annFUN.gene2GO(whichOnto = ontology, gene2GO = geneID2GO)
# Preparing the topGO analysis for the Biological Process (BP) ontology:
GOdata = new("topGOdata", description="diff expr GO test", ontology= ontology,
allGenes = geneUniverse, geneSel = sel_pval,
nodeSize = 10,
annot = annFUN.GO2genes,
GO2genes=go2genes)
allGO = usedGO(object = GOdata)
GOresult = runTest(GOdata, statistic = "ks", algorithm = "elim")
allRes <- GenTable(GOdata, pval = GOresult, topNodes = length(allGO))
kable(allRes[1:20,])
GO.ID | Term | Annotated | Significant | Expected | pval |
---|---|---|---|---|---|
GO:0002181 | cytoplasmic translation | 180 | 174 | 144.11 | < 1e-30 |
GO:0000462 | maturation of SSU-rRNA from tricistronic… | 104 | 101 | 83.27 | 1.3e-13 |
GO:0000027 | ribosomal large subunit assembly | 43 | 43 | 34.43 | 2.5e-10 |
GO:0000463 | maturation of LSU-rRNA from tricistronic… | 52 | 48 | 41.63 | 3.5e-10 |
GO:0042273 | ribosomal large subunit biogenesis | 127 | 122 | 101.68 | 1.6e-08 |
GO:0000447 | endonucleolytic cleavage in ITS1 to sepa… | 44 | 43 | 35.23 | 4.4e-07 |
GO:0000480 | endonucleolytic cleavage in 5’-ETS of tr… | 33 | 32 | 26.42 | 5.2e-07 |
GO:0000472 | endonucleolytic cleavage to generate mat… | 34 | 33 | 27.22 | 8.9e-07 |
GO:0000466 | maturation of 5.8S rRNA from tricistroni… | 91 | 86 | 72.86 | 9.5e-07 |
GO:0006407 | rRNA export from nucleus | 16 | 16 | 12.81 | 1.9e-06 |
GO:0015986 | proton motive force-driven ATP synthesis | 18 | 16 | 14.41 | 2.2e-05 |
GO:0006364 | rRNA processing | 280 | 264 | 224.18 | 4.4e-05 |
GO:0000028 | ribosomal small subunit assembly | 26 | 24 | 20.82 | 4.6e-05 |
GO:0030488 | tRNA methylation | 24 | 22 | 19.22 | 5.5e-05 |
GO:0120009 | intermembrane lipid transfer | 21 | 20 | 16.81 | 5.9e-05 |
GO:0006450 | regulation of translational fidelity | 31 | 28 | 24.82 | 8.6e-05 |
GO:0048278 | vesicle docking | 34 | 32 | 27.22 | 0.00013 |
GO:0034727 | piecemeal microautophagy of the nucleus | 41 | 40 | 32.83 | 0.00018 |
GO:0001732 | formation of cytoplasmic translation ini… | 10 | 10 | 8.01 | 0.00020 |
GO:0043328 | protein transport to vacuole involved in… | 28 | 24 | 22.42 | 0.00022 |
Based on the name/description of these functional categories, can you identify some that are clearly related to aerobic vs anaerobic growth? Is the functional category cellular respiration significantly over/under represented among diferentially expressed genes?
Now looking at Cellular Components:
ontology <- "CC"
go2genes <- annFUN.gene2GO(whichOnto = ontology, gene2GO = geneID2GO)
GOdata = new("topGOdata", description="diff expr GO test", ontology= ontology,
allGenes = geneUniverse, geneSel = sel_pval,
nodeSize = 10,
annot = annFUN.GO2genes,
GO2genes=go2genes)
allGO = usedGO(object = GOdata)
GOresult = runTest(GOdata, statistic = "KS", algorithm = "elim")
allRes <- GenTable(GOdata, pval = GOresult, topNodes = length(allGO))
kable(allRes[1:10,])
GO.ID | Term | Annotated | Significant | Expected | pval |
---|---|---|---|---|---|
GO:0030686 | 90S preribosome | 90 | 87 | 72.12 | < 1e-30 |
GO:0022625 | cytosolic large ribosomal subunit | 80 | 78 | 64.11 | < 1e-30 |
GO:0005730 | nucleolus | 289 | 262 | 231.59 | 1.0e-15 |
GO:0022627 | cytosolic small ribosomal subunit | 59 | 55 | 47.28 | 1.9e-14 |
GO:0032040 | small-subunit processome | 53 | 53 | 42.47 | 3.5e-12 |
GO:0030687 | preribosome, large subunit precursor | 62 | 60 | 49.68 | 8.5e-11 |
GO:0000839 | Hrd1p ubiquitin ligase ERAD-L complex | 12 | 12 | 9.62 | 8.6e-07 |
GO:0030688 | preribosome, small subunit precursor | 23 | 23 | 18.43 | 2.3e-06 |
GO:0005737 | cytoplasm | 4349 | 3546 | 3485.11 | 3.0e-05 |
GO:0005736 | RNA polymerase I complex | 12 | 12 | 9.62 | 3.4e-05 |