Gene expression in budding yeast responding to alcohols with and without oxygen

The goal of this lab is to analyze transcriptomic data from a previously published dataset. The data used in this lab comes from a paper published in 2018: Genotype-by-Environment-by-Environment Interactions in the Saccharomyces cerevisiae Transcriptomic Response to Alcohols and Anaerobiosis (https://pubmed.ncbi.nlm.nih.gov/30301737/)

The full dataset is available on NCBI GEO/SRA (two linked databases that contain Gene Expression/Short Read sequences data) at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118069

The data is made of RNAseq reads from several yeast strains (IL01, NCY3290 and Y7568), exposed to different agents (Butanol, Ethanol, Isobutanol, YPD) and grown under aerobic or anaerobic condition.

While the analyzes done in this paper are fairly complex (due to the multi-factorial design of the experiment), we will focus on differences in gene expression between aerobic and anaerobic growth conditions and compare the results obtained with those of the 2006 study presented in class (https://pubmed.ncbi.nlm.nih.gov/17322208/).

Initial quantification step

Many RNAseq pipelines use the tool tophat2 to map reads to a reference genome and then cufflinks to estimate the abundance (RPKM or FPKM) of each gene. Here, we will use the new generation of tools that use pseudoalignment. Pseudoalignement relies on kmers to find the most likely transcript origin for each sequencing read (see: https://pachterlab.github.io/kallisto/about and https://tinyheero.github.io/2015/09/02/pseudoalignments-kallisto.html). These methods are a LOT faster than traditional alignment methods.
The program used here to perform this step is Kallisto (https://pachterlab.github.io/kallisto/manual). All the sequencing files are pre-downloaded and available in: /mnt/scratch/goutfs/CMB8013/RNAseq/SRP156315/

Our first step will be to run the Kallisto pseudomapping program for each of these files. Kallisto requires a first step of indexing the sequences against which RNAseq reads will be pseduomapped. The index is already available at: /mnt/scratch/goutfs/CMB8013/RNAseq/index/transcripts.idx. It was obtained with:

cd /mnt/scratch/goutfs/CMB8013/RNAseq/index
module load Kallisto
kallisto index -i transcripts.idx Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz

Exercise: write a bash script to automatically run Kallisto on each sequencing file.

The file Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz contains all the predicted transcripts for the yeast genome and was downloaded from the ensembl website.
The pseudomapping/quantification step is performed with the following general command (for unpaired reads of size 300 with sd 50):

kallisto quant -i [indexFile] -o [outputFolder] --single -l 300 -s 50 [reads.fastq.gz]

It would be quite cumbersome to type this command 48 times (for the 48 files that need to be processed). Therefore, we will write a bash script to automatize this process. The general structure of the script should be as follows:

#!/bin/bash

For each of the SRR*.fastq files
  Extract the SRR number from the file name --> to be used for the output folder name
  Prepare file names for slurm, log and err (for example: slurm_SRR7642939.sh, SRR7642939.log and SRR7642939.err)
  Write the corresponding slurm command into the slurm file
  submit the slurm file to the queue
done

For the second part of this lab, you will need to have the files generated by Kallisto. If you ran into some issues with this first part, you can download the corresponding files (solution) from: /mnt/scratch/goutfs/CMB8013/RNAseq/Abundances
You will also need to use a brief summary of the experiment, which is available in the text-tabulated file: /mnt/scratch/goutfs/CMB8013/RNAseq/experiment.tab

You can download these files to your computer and work locally or perform all the steps on Lugh.

Analysis in R

This roadmap is following the steps given in the official DESeq2 tutorial available here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Preparing the data:

Look at the content of the experiment.tab file:

txp = read.table("experiment.tab", h=T, sep="\t")

Run	agent	BioProject	BioSample	SampleName	Strain	oxy
SRR7642929	YPD	PRJNA484406	SAMN09764590	GSM3318008	Y7568	AER
SRR7642930	YPD	PRJNA484406	SAMN09764636	GSM3318009	Y7568	AER
SRR7642931	YPD	PRJNA484406	SAMN09764635	GSM3318010	NCY3290	ANAE
SRR7642932	YPD	PRJNA484406	SAMN09764634	GSM3318011	NCY3290	ANAE
SRR7642933	YPD	PRJNA484406	SAMN09764633	GSM3318012	NCY3290	AER
SRR7642934	Isobutanol	PRJNA484406	SAMN09764632	GSM3318013	NCY3290	ANAE
SRR7642935	Isobutanol	PRJNA484406	SAMN09764631	GSM3318014	NCY3290	ANAE
SRR7642936	YPD	PRJNA484406	SAMN09764629	GSM3318015	NCY3290	AER
SRR7642937	YPD	PRJNA484406	SAMN09764628	GSM3318016	IL01	AER
SRR7642938	Butanol	PRJNA484406	SAMN09764627	GSM3318017	Y7568	AER
SRR7642939	Butanol	PRJNA484406	SAMN09764626	GSM3318018	Y7568	AER
SRR7642940	Isobutanol	PRJNA484406	SAMN09764625	GSM3318019	IL01	ANAE
SRR7642941	Isobutanol	PRJNA484406	SAMN09764630	GSM3318020	IL01	ANAE
SRR7642942	Butanol	PRJNA484406	SAMN09764623	GSM3318021	NCY3290	ANAE
SRR7642943	Butanol	PRJNA484406	SAMN09764589	GSM3318022	NCY3290	ANAE
SRR7642944	Butanol	PRJNA484406	SAMN09764619	GSM3318023	IL01	AER
SRR7642945	Butanol	PRJNA484406	SAMN09764618	GSM3318024	IL01	AER
SRR7642946	YPD	PRJNA484406	SAMN09764617	GSM3318025	Y7568	ANAE
SRR7642947	YPD	PRJNA484406	SAMN09764624	GSM3318026	Y7568	ANAE
SRR7642948	Ethanol	PRJNA484406	SAMN09764616	GSM3318027	NCY3290	AER
SRR7642949	Ethanol	PRJNA484406	SAMN09764615	GSM3318028	NCY3290	AER
SRR7642950	YPD	PRJNA484406	SAMN09764614	GSM3318029	IL01	ANAE
SRR7642951	YPD	PRJNA484406	SAMN09764613	GSM3318030	IL01	ANAE
SRR7642952	Isobutanol	PRJNA484406	SAMN09764612	GSM3318031	IL01	AER
SRR7642953	Isobutanol	PRJNA484406	SAMN09764611	GSM3318032	Y7568	AER
SRR7642954	Isobutanol	PRJNA484406	SAMN09764610	GSM3318033	Y7568	AER
SRR7642955	Butanol	PRJNA484406	SAMN09764609	GSM3318034	NCY3290	AER
SRR7642956	Butanol	PRJNA484406	SAMN09764608	GSM3318035	NCY3290	AER
SRR7642957	Ethanol	PRJNA484406	SAMN09764607	GSM3318036	Y7568	AER
SRR7642958	Ethanol	PRJNA484406	SAMN09764606	GSM3318037	Y7568	AER
SRR7642959	Isobutanol	PRJNA484406	SAMN09764605	GSM3318038	NCY3290	AER
SRR7642960	Ethanol	PRJNA484406	SAMN09764604	GSM3318039	Y7568	ANAE
SRR7642961	Ethanol	PRJNA484406	SAMN09764603	GSM3318040	Y7568	ANAE
SRR7642962	Isobutanol	PRJNA484406	SAMN09764602	GSM3318041	NCY3290	AER
SRR7642963	Ethanol	PRJNA484406	SAMN09764601	GSM3318042	IL01	AER
SRR7642964	Ethanol	PRJNA484406	SAMN09764600	GSM3318043	IL01	AER
SRR7642965	Isobutanol	PRJNA484406	SAMN09764599	GSM3318044	IL01	AER
SRR7642966	Isobutanol	PRJNA484406	SAMN09764598	GSM3318045	Y7568	ANAE
SRR7642967	Isobutanol	PRJNA484406	SAMN09764597	GSM3318046	Y7568	ANAE
SRR7642968	Ethanol	PRJNA484406	SAMN09764596	GSM3318047	NCY3290	ANAE
SRR7642969	Ethanol	PRJNA484406	SAMN09764595	GSM3318048	NCY3290	ANAE
SRR7642970	Butanol	PRJNA484406	SAMN09764594	GSM3318049	Y7568	ANAE
SRR7642971	Ethanol	PRJNA484406	SAMN09764593	GSM3318050	IL01	ANAE
SRR7642972	Ethanol	PRJNA484406	SAMN09764592	GSM3318051	IL01	ANAE
SRR7642973	Butanol	PRJNA484406	SAMN09764591	GSM3318052	Y7568	ANAE
SRR7642974	YPD	PRJNA484406	SAMN09764622	GSM3318053	IL01	AER
SRR7642975	Butanol	PRJNA484406	SAMN09764621	GSM3318054	IL01	ANAE
SRR7642976	Butanol	PRJNA484406	SAMN09764620	GSM3318055	IL01	ANAE

In this experiment, multiple variables change from one sample to the other: the agent (YPD, Isobutanol, Butanol or Ethanol), the strain and the growth condition (aerobic vs anaerobic, column “oxy”).

Let’s add a column with the location of the abundance file (generated by Kallisto) for each sample. On my computer, I have all these files in a folder named “Abundances”. The file names are [Run].tsv

txp$file = paste("Abundances/", txp$Run, ".tsv", sep="")

kable(txp[1:4,])

Run	agent	BioProject	BioSample	SampleName	Strain	oxy	file
SRR7642929	YPD	PRJNA484406	SAMN09764590	GSM3318008	Y7568	AER	Abundances/SRR7642929.tsv
SRR7642930	YPD	PRJNA484406	SAMN09764636	GSM3318009	Y7568	AER	Abundances/SRR7642930.tsv
SRR7642931	YPD	PRJNA484406	SAMN09764635	GSM3318010	NCY3290	ANAE	Abundances/SRR7642931.tsv
SRR7642932	YPD	PRJNA484406	SAMN09764634	GSM3318011	NCY3290	ANAE	Abundances/SRR7642932.tsv

Let’s load all the packages needed for the analysis. These packages will allow us to query the ensembl database online and obtain the list of genes/transcripts from a given organism (in this case: the yeast Saccharomyces cerevisiae).

library("tximport")
library("DESeq2")
library("ggplot2")

library("ensembldb")
library(AnnotationHub)

Obtaining gene/transcript identifiers:

################################################################################
# We start by retrieving information from ensembl about gene/transcripts names:
# Our goal here is to create a two columns table which will link the transcripts 
# to their respective genes.

ah <- AnnotationHub()
#ahDb <- query(ah, pattern = c("Saccharomyces cerevisiae", "EnsDb", 103))
#ahDb
edb <- ah[["AH89534"]]

txs <- transcripts(edb, return.type = "DataFrame")
# Unfortunately, the transcript IDs returned here have an added "_mRNA" text at the end.
# Let's remove it:
txs$tID = gsub("_mRNA", "", txs$tx_id)
tl = txs[ , c("tID", "gene_id")]
colnames(tl) = c("TXNAME", "GENEID")

# To avoid doing this step again, we can save the "tl" data.frame (which contains 
# the relationships between transcripts and genes) in a text file (and simply 
# read this file next time we want to analyze the data, with maybe different 
# parameters):

write.table(tl, file="gene_transcript_link.tab", col.names=T, row.names=F, quote=F, sep="\t")

tx2gene = tl

# Next time, you can skip all this code above and simply use:

tx2gene = read.table("gene_transcript_link.tab", h=T, sep="\t")

Importing the data in R (using TXI and DESeq2):

# Now, let's use DESeq2 to analyze the data:

# The columns containing variables of interest need to be transformed into factors (see: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#note-on-factor-levels and https://www.stat.berkeley.edu/~s133/factors.html)
txp$agent = factor(txp$agent)
txp$Strain = factor(txp$Strain)
txp$oxy = factor(txp$oxy)

# Importing all the abundance calculations performed by Kallisto:
txi <- tximport(txp$file, type="kallisto", tx2gene=tx2gene)


# Creating the DESeq dataset:
# The most important part here is the "design" formula which tells DESeq2 that we are interested in contrasting aerobic vs anaerobic (column "oxy") but that it should know that two other variables (agent and Strain) can contribute to the variance in expression level. 
ddsTxi <- DESeqDataSetFromTximport(txi, colData = txp, design = ~ agent + Strain + oxy)

# These next 3 lines will create a matrix of transcript abundances which you can use to look at values in each sample
ddsc <- estimateSizeFactors(ddsTxi)
c = counts(ddsc, normalized=TRUE)
colnames(c) = paste(txp$Run, txp$oxy, txp$agent, txp$SampleName, sep="_")

# Let's remove genes that have low expression level throughout the experiment.
# The advantage of doing so is that fewer genes in the analysis = more statistical power during the correction for multiple testing.
# The cutoff at 200 is arbitrary.
MIN_ROW_SUM = 20
keep <- which( rowSums(counts(ddsTxi)) >= MIN_ROW_SUM )
length(keep)

## [1] 6234

ddsTxi <- ddsTxi[keep,]

# How many genes were removed by this step?

# Now, we let DESeq perform its magic...
dds <- DESeq(ddsTxi)
DESeq2res = results(dds, cooksCutoff = Inf)
DESeq2res

## log2 fold change (MLE): oxy ANAE vs AER 
## Wald test p-value: oxy ANAE vs AER 
## DataFrame with 6234 rows and 6 columns
##            baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##           <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Q0010       2.28137      -0.335161 0.3522456 -0.951498 3.41352e-01 3.83629e-01
## Q0045   15258.63531      -0.574666 0.0915521 -6.276931 3.45322e-10 7.95836e-10
## Q0050   10817.95191      -0.995330 0.3083073 -3.228370 1.24498e-03 1.88059e-03
## Q0055    3424.40498       0.712579 0.8187071  0.870372 3.84097e-01 4.26973e-01
## Q0060     218.43571      -0.194925 0.3618107 -0.538750 5.90060e-01 6.29330e-01
## ...             ...            ...       ...       ...         ...         ...
## YPR200C    10.60230      -0.239324  0.265586 -0.901119  0.36752520  0.41038010
## YPR201W    58.06971       0.380461  0.239714  1.587149  0.11247891  0.13754286
## YPR202W   190.41130       0.339468  0.113332  2.995358  0.00274123  0.00402829
## YPR203W     3.13892       0.579292  0.398659  1.453101  0.14619560  0.17545618
## YPR204W   121.82286       0.665697  0.498179  1.336259  0.18146449  0.21461765

Exercise:

How many genes have a significant differential expression between aerobic and anaerobic growth (at the alpha level of 0.05) before and after correction for multiple testing?

Modify the code to select only the samples with Butanol or Isobutanol and find the genes that are deferentially expressed between these two agents. Here is the result that you should obtain:

## log2 fold change (MLE): agent Isobutanol vs Butanol 
## Wald test p-value: agent Isobutanol vs Butanol 
## DataFrame with 5834 rows and 6 columns
##          baseMean log2FoldChange     lfcSE        stat    pvalue      padj
##         <numeric>      <numeric> <numeric>   <numeric> <numeric> <numeric>
## Q0045   15418.863     0.04153796  0.111022  0.37414119  0.708299  0.951690
## Q0050   12676.026     0.00125003  0.159697  0.00782754  0.993755  0.999187
## Q0055    3376.244    -0.08985234  0.280942 -0.31982549  0.749101  0.958376
## Q0060     264.849     0.24323798  0.191124  1.27267240  0.203134  0.669540
## Q0065    2297.199     0.25096109  0.171878  1.46011050  0.144260  0.573934
## ...           ...            ...       ...         ...       ...       ...
## YPR199C  198.6523      0.0389282 0.0900152    0.432462 0.6654056  0.939745
## YPR200C   12.6383      0.5795473 0.2553661    2.269476 0.0232394  0.216926
## YPR201W   75.7631      0.5022001 0.2185290    2.298093 0.0215565  0.206165
## YPR202W  173.8028      0.2622094 0.1452016    1.805830 0.0709448  0.401837
## YPR204W  113.7776     -0.3492797 0.6858166   -0.509290 0.6105488  0.915446

Visualizing data

We can look at the abundances for any given gene with the following code:

# Looking at gene YDR518W
geneID = "YDR518W"
plotCounts(dds, gene=geneID, intgroup="oxy")

# We can even split the data by aero/aner + agent to see if the increased/decreased expression level is specific to a certain agent
plotCounts(dds, gene=geneID, intgroup=c("oxy", "agent"))

# Or specific to a certain strain?
plotCounts(dds, gene=geneID, intgroup=c("oxy", "Strain"))

# Let's make these plots nicer with ggplot. For this, we will capture the data returned by "plotCounts" and then use ggplot to perform the actual ploting:
d <- plotCounts(dds, gene=geneID, intgroup=c("oxy"), returnData=TRUE)
ggplot(d, aes(x=oxy, y=count) ) + geom_point(position=position_jitter(w=0.1,h=0))

d <- plotCounts(dds, gene=geneID, intgroup=c("oxy", "agent"), returnData=TRUE)
ggplot(d, aes(x=interaction(oxy,agent), y=count, color=oxy, shape=agent)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) +
  theme(axis.text.x=element_text(angle=45,hjust=1))

Exercise:

The 2006 study found that only 5 out of 27 genes tested confer an advantage for anaerobic growth. Are these 5 genes significantly more highly expressed in anaerobic vs aerobic condition in this new experiment? The list of genes is: eug1, izh2, plb2, ylr413w and yor012w. You can convert the gene name into its corrsponding systematic ID by searching for it on: https://www.yeastgenome.org/

For example, the ID for eug1 in this experiment is: YDR518W

The main result from the 2006 study was that many genes with increased expression level under anaerobic conditions did not contribute to the fitness/survival of yeast when grown anaerobically. One possible explanation for this surprising result is that some of the genes in the list used in 2006 are not really over-expressed under anaerobic growth (false positives). One such gene is muc1. Does the new data confirm that muc1 is over-expressed under anaerobic growth? How about the gene ysr3?

Data quality control and clustering

This step can be extremely useful to spot problems with the data. For example, if a sample was incorrectly labeled or switched (these things happen more frequently than they should in the real world…). Here, we will perform a clustering of the samples to detect any problem.

library("pheatmap")
library("vsn")

# Normalizing the data
blind = F
vsd <- vst(dds, blind=blind)

# We will use the top 100 genes with the highest expression level to perform the clustering:
nbTopGenes = 100
select <- order(rowMeans(counts(dds,normalized=TRUE)), decreasing=TRUE)[1:nbTopGenes]
df <- as.data.frame(colData(dds)[,c("oxy","agent", "Strain")])
ph <- pheatmap(assay(vsd)[select,], cluster_rows=T, show_rownames=F, cluster_cols=T, annotation_col=df)
ph

In this heatmap, each row represent the expression pattern of a gene across all the samples (samples are in columns). Which variable (Strain, agent or oxy) has the most clustering/predictive power? Do you see any odd sample?

Gene ontology analysis.

Genes can be annotated based on their function. A massive effort to annotate the function of genes is the Gene Ontology project: http://geneontology.org/

For example, the gene eug1 (one of the 5 genes from the 2006 study) belongs to the functional category “protein disulfide isomerase activity” (GO:0006467), which is itself a sub-category of the more general “protein folding” category. See: http://amigo.geneontology.org/amigo/gene_product/SGD:S000002926

Gene ontology categories can be of three types: Molecular Function (MF), Biological Process (BP), or Cellular Component (CC). Our eug1 gene is also a member of GO:0005783 (endoplasmic reticulum) in the Cellular Component ontology.

Note that not all genes are assigned a functional category. For many genes, the function that they perform is still unknown.

It is often desirable to find if certain functional categories are over-represented among genes deferentially expressed in an experiment. For example here, we could be interested in finding which types of genes tend to be diferentially expressed in response to anaerobic growth. This type of analysis is called a GO enrichement analysis. Here is a short road map about how to perform this analysis:

library("topGO")
library("org.Sc.sgd.db")


# The main information needed for the GO enrichement analysis is a list of values summarizing the experiment for each gene (what is called the "gene universe")
# The value can be anything, but what is typically used is the p-value on the differential expression test (although the log fold change is also used sometimes)
# The goal is to create a vector with as many elements as there are genes in the analysis. The values inside the vector are the p-values and the names of the elements in the vector are the gene names.
# The following short function creates this vector for us:
getGeneUniverse <- function(deseqRes){
  vgs = as.numeric(deseqRes$pvalue)
  names(vgs) = rownames(deseqRes)
  vgs
}

# The sel_pval function is to be used for some enrichment analysis. Some tests use a hard cutt-off to generate list of genes deferentially expressed. In this case, the cutoff would be a p-value of 0.05.
# When using this strategy, the GO enrichment is done as follows:
# - Let's assume that the gene universe is 10,000 genes.
# - 2,000 of these genes are deferentially expressed (with p-value cutoff of 0.05)
# - 500 of the 10,000 genes are annotated as belonging to GO:00001 (a fictional Gene Ontology category)
# - 400 of the 2,000 belong to GO:00001
# Is GO:00001 over-represented among deferentially expressed genes?
# (The null expectation would be: 500*0.2 = 100)
# This can be tested with a simple Fisher test.
# Note that more sophisticated methods (Such as Kolmogorov–Smirnov) are also available and circumvent the use of arbitrary cutoffs
sel_pval <- function(allScore){ return(allScore < 0.05)}

geneUniverse = getGeneUniverse(DESeq2res)
  
dGO <- as.data.frame(org.Sc.sgdGO)
geneID2GO <- list()
for(geneID in unique(dGO$systematic_name)){
  geneID2GO[[geneID]] <- dGO$go_id[which(dGO$systematic_name==geneID)]
}

ontology <- "BP"
go2genes <- annFUN.gene2GO(whichOnto = ontology, gene2GO = geneID2GO)


# Preparing the topGO analysis for the Biological Process (BP) ontology:
GOdata = new("topGOdata", description="diff expr GO test", ontology= ontology,  
             allGenes = geneUniverse, geneSel = sel_pval, 
             nodeSize = 10, 
             annot = annFUN.GO2genes,
             GO2genes=go2genes)

allGO = usedGO(object = GOdata)
GOresult = runTest(GOdata, statistic = "ks", algorithm = "elim")

allRes <- GenTable(GOdata, pval = GOresult, topNodes = length(allGO))

kable(allRes[1:20,])

GO.ID	Term	Annotated	Significant	Expected	pval
GO:0002181	cytoplasmic translation	180	174	144.11	< 1e-30
GO:0000462	maturation of SSU-rRNA from tricistronic…	104	101	83.27	1.3e-13
GO:0000027	ribosomal large subunit assembly	43	43	34.43	2.5e-10
GO:0000463	maturation of LSU-rRNA from tricistronic…	52	48	41.63	3.5e-10
GO:0042273	ribosomal large subunit biogenesis	127	122	101.68	1.6e-08
GO:0000447	endonucleolytic cleavage in ITS1 to sepa…	44	43	35.23	4.4e-07
GO:0000480	endonucleolytic cleavage in 5’-ETS of tr…	33	32	26.42	5.2e-07
GO:0000472	endonucleolytic cleavage to generate mat…	34	33	27.22	8.9e-07
GO:0000466	maturation of 5.8S rRNA from tricistroni…	91	86	72.86	9.5e-07
GO:0006407	rRNA export from nucleus	16	16	12.81	1.9e-06
GO:0015986	proton motive force-driven ATP synthesis	18	16	14.41	2.2e-05
GO:0006364	rRNA processing	280	264	224.18	4.4e-05
GO:0000028	ribosomal small subunit assembly	26	24	20.82	4.6e-05
GO:0030488	tRNA methylation	24	22	19.22	5.5e-05
GO:0120009	intermembrane lipid transfer	21	20	16.81	5.9e-05
GO:0006450	regulation of translational fidelity	31	28	24.82	8.6e-05
GO:0048278	vesicle docking	34	32	27.22	0.00013
GO:0034727	piecemeal microautophagy of the nucleus	41	40	32.83	0.00018
GO:0001732	formation of cytoplasmic translation ini…	10	10	8.01	0.00020
GO:0043328	protein transport to vacuole involved in…	28	24	22.42	0.00022

Question:

Based on the name/description of these functional categories, can you identify some that are clearly related to aerobic vs anaerobic growth? Is the functional category cellular respiration significantly over/under represented among diferentially expressed genes?

Now looking at Cellular Components:

ontology <- "CC"

go2genes <- annFUN.gene2GO(whichOnto = ontology, gene2GO = geneID2GO)
GOdata = new("topGOdata", description="diff expr GO test", ontology= ontology,  
             allGenes = geneUniverse, geneSel = sel_pval, 
             nodeSize = 10, 
             annot = annFUN.GO2genes,
             GO2genes=go2genes)

allGO = usedGO(object = GOdata)
GOresult = runTest(GOdata, statistic = "KS", algorithm = "elim")

allRes <- GenTable(GOdata, pval = GOresult, topNodes = length(allGO))

kable(allRes[1:10,])

GO.ID	Term	Annotated	Significant	Expected	pval
GO:0030686	90S preribosome	90	87	72.12	< 1e-30
GO:0022625	cytosolic large ribosomal subunit	80	78	64.11	< 1e-30
GO:0005730	nucleolus	289	262	231.59	1.0e-15
GO:0022627	cytosolic small ribosomal subunit	59	55	47.28	1.9e-14
GO:0032040	small-subunit processome	53	53	42.47	3.5e-12
GO:0030687	preribosome, large subunit precursor	62	60	49.68	8.5e-11
GO:0000839	Hrd1p ubiquitin ligase ERAD-L complex	12	12	9.62	8.6e-07
GO:0030688	preribosome, small subunit precursor	23	23	18.43	2.3e-06
GO:0005737	cytoplasm	4349	3546	3485.11	3.0e-05
GO:0005736	RNA polymerase I complex	12	12	9.62	3.4e-05

Analyzing yeast RNAseq data

JF GOUT

Gene expression in budding yeast responding to alcohols with and without oxygen

Initial quantification step

Exercise: write a bash script to automatically run Kallisto on each sequencing file.

Analysis in R

Preparing the data:

Obtaining gene/transcript identifiers:

Importing the data in R (using TXI and DESeq2):

Exercise:

Visualizing data

Exercise:

Data quality control and clustering

Gene ontology analysis.

Question: