Final project for CMB 8013

The final project consists in multiple steps covering many of the different skills you have learned in this class (R, programming, using a high-performance computing system, …). You will need to provide a report in the form of an R Markdown document (both source and PDF output) to canvas. This report should contain all the source code (both R and bash) that you used to accomplish the project as well as the results and their discussion (don’t forget to answer the questions asked throughout the description of the project).

Wednesday, May 10th noon-3pm (the time reserved for Final Exam) will be a time available for you to work on this final project from class and I will be available during that time to answer questions.

The deadline to upload the final report is Friday May 12th at midnight (Central Time).

Part 1: Genome assembly

The file /mnt/scratch/goutfs/CMB8013/FinalProject/reads/SRR11278168.fastq on LUGH contains unpaired reads generated to sequence the genome of an organism. Use the Spades assembler to generate a genome assembly for this organism. Make sure to include the code used in your R Markdown report. You can include code in R Markdown with the following syntax:

```{bash, eval = F}

some code

```

Example:

#!/bin/bash

cd myFolder

module load myModule

spades.py --some-argument --another-argument

After you have assembled the genome, answer the following questions (make sure to explain how you reached your answers):

1.1. What is the organism that was sequenced?
1.2. What are the total size, number of scaffolds and the N50 of the assembly? Is this a good or bad N50 value?
1.3. In your opinion, would it be useful to pursue a different strategy (paired-end sequencing, long reads technology, deeper sequencing, …) to improve the quality of this assembly?

2. Open Reading Frame Hunting

You will now search for Open Reading Frame (ORFs) in the genome assembly. To make sure that everyone uses the same starting point, use the solution provided here: /mnt/scratch/goutfs/CMB8013/FinalProject/misc/scaffolds-rc.fasta This is the reverse-complement of the assembly file which you should have obtained in the previous step.

Use the method described in class (R, package ORFik) to search for ORFs on the plus strand only (keep only ORFs that start with an ATG and are at least 120nt long).

How many ORFs did you find? Does this sound reasonable? (based on what information you could gather about the organism being sequenced)
In preparation for the future steps, this would be a good time to perform the following step: Extract the sequence of each ORF into a file named ORFs.fna

Hint: you will need to use the functions getSeq (from Biostrings) and export (from rtracklayer)

Make sure to name your ORFs (ideally: ORF1, ORF2, ORF3, …)

3. Finding a specific gene

The file /mnt/scratch/goutfs/CMB8013/FinalProject/misc/mystery.faa contains the sequence of a protein. Your goal is to find the location of the sequence in the genome assembly that codes for this protein (without using the results from the previous step).

Does this result correspond to one of the ORFs found in the previous step?

4. Gene Expression

Use Kallisto to obtain transcript abundance estimate for the ORFs that you have annotated in step #2. You will use RNAseq data from this file: /mnt/scratch/goutfs/CMB8013/FinalProject/reads/RNAseq.fastq
In prevision for the next substep, you will need to make sure that kallisto also generates a pseudomapping file (.bam file) with coordinates relative to the genome assembly. Because this step requires a very specific file format, I am providing the GTF file to use in this step here: /mnt/scratch/goutfs/CMB8013/FinalProject/misc/orfs.gtf

Which ORFs have the highest and lowest expression levels?

Use the strategy described here: (https://jfgout.github.io/CMB8013/R/genomics-1.html)[https://jfgout.github.io/CMB8013/R/genomics-1.html] to generate an image representing the genome with the annotated ORFs and the coverage from the RNAseq data mapped by Kallisto. This is what the resuklt should look like:

5. Simulations

Dr. Leavitt is investigating the transcriptomic response of Paramecium tetraurelia to changes in environmental conditions. She finds that, out of 40,000 protein-coding genes, 2,000 genes are differentially expressed upon changing the medium temperature from 20C to 30C. Dr. Leavitt wants to test if genes annotated as “Heat Shock Protein” are over-represented among genes differentially expressed in this experiment. She finds that 70 of the 1,200 annotated “Heat Shock Proteins” are differentially expressed.

Use basic R simulations such as described in https://jfgout.github.io/CMB8013/R/simulations.html to test if there is a significant enrichement in Heat-Shock proteins.