The final project consists in multiple steps covering many of
the different skills you have learned in this class (R, programming,
using a high-performance computing system, …). You will need to provide
a report in the form of an R Markdown document (both source and PDF
output) to canvas. This report should contain all the source code (both
R and bash) that you used to accomplish the project as well as the
results and their discussion (don’t forget to answer the questions asked
throughout the description of the project).
Wednesday, May 10th noon-3pm (the time reserved for Final Exam) will be a time available for you to work on this final project from class and I will be available during that time to answer questions.
The deadline to upload the final report is Friday May 12th at midnight (Central Time).
The file /mnt/scratch/goutfs/CMB8013/FinalProject/reads/SRR11278168.fastq on LUGH contains unpaired reads generated to sequence the genome of an organism. Use the Spades assembler to generate a genome assembly for this organism. Make sure to include the code used in your R Markdown report. You can include code in R Markdown with the following syntax:
```{bash, eval = F}
some code
```
Example:
#!/bin/bash
cd myFolder
module load myModule
spades.py --some-argument --another-argument
After you have assembled the genome, answer the following questions (make sure to explain how you reached your answers):
1.1. What is the organism that was sequenced?
1.2. What are the total size, number of scaffolds and the N50 of the assembly? Is this a good or bad N50 value?
1.3. In your opinion, would it be useful to pursue a different strategy (paired-end sequencing, long reads technology, deeper sequencing, …) to improve the quality of this assembly?
You will now search for Open Reading Frame (ORFs) in the genome
assembly. To make sure that everyone uses the same starting point, use
the solution provided here:
/mnt/scratch/goutfs/CMB8013/FinalProject/misc/scaffolds-rc.fasta
This is the reverse-complement of the assembly file which you should
have obtained in the previous step.
Use the method described in class (R, package ORFik) to search for
ORFs on the plus strand only (keep only ORFs that start with an ATG and
are at least 120nt long).
How many ORFs did you find? Does this sound reasonable? (based on
what information you could gather about the organism being sequenced)
In preparation for the future steps, this would be a good time to
perform the following step: Extract the sequence of each ORF
into a file named ORFs.fna
Hint: you will need to use the functions getSeq
(from Biostrings) and export (from rtracklayer)
Make sure to name your ORFs (ideally: ORF1, ORF2, ORF3, …)
The file
/mnt/scratch/goutfs/CMB8013/FinalProject/misc/mystery.faa
contains the sequence of a protein. Your goal is to find the location of
the sequence in the genome assembly that codes for this protein (without
using the results from the previous step).
Does this result correspond to one of the ORFs found in the
previous step?
Use Kallisto to obtain transcript abundance estimate
for the ORFs that you have annotated in step #2. You will use RNAseq
data from this file:
/mnt/scratch/goutfs/CMB8013/FinalProject/reads/RNAseq.fastq
In prevision for the next substep, you will need to make sure that
kallisto also generates a pseudomapping file (.bam file) with
coordinates relative to the genome assembly. Because this step requires
a very specific file format, I am providing the GTF file to use in this
step here:
/mnt/scratch/goutfs/CMB8013/FinalProject/misc/orfs.gtf
Which ORFs have the highest and lowest expression levels?
Use the strategy described here: (https://jfgout.github.io/CMB8013/R/genomics-1.html)[https://jfgout.github.io/CMB8013/R/genomics-1.html] to
generate an image representing the genome with the annotated ORFs and
the coverage from the RNAseq data mapped by Kallisto. This is what the
resuklt should look like:
Dr. Leavitt is investigating the transcriptomic response of
Paramecium tetraurelia to changes in environmental conditions.
She finds that, out of 40,000 protein-coding genes, 2,000 genes are
differentially expressed upon changing the medium temperature from 20C
to 30C. Dr. Leavitt wants to test if genes annotated as “Heat Shock
Protein” are over-represented among genes differentially expressed in
this experiment. She finds that 70 of the 1,200 annotated “Heat Shock
Proteins” are differentially expressed.
Use basic R simulations such as described in https://jfgout.github.io/CMB8013/R/simulations.html
to test if there is a significant enrichement in Heat-Shock
proteins.