In this lab, we will learn how to perform a basic search for Open
Reading Frames (ORFs) in a DNA sequence. We will start by writing our
own ORF hunting function as a way to practice our algorithm invention
skills and then move onto using available tools.
You will need to submit a lab report, in the form of an RMarkdown
document with the following items:
A function that finds the size of an ORF starting from a given position in a sequence.
A function that finds all possible ORFs in a sequence.
The application of these function to a simple dataset and the identification of the gene in this dataset.
A simulation of the expected size of ORFs in random sequences.
For this part, we’ll use a small DNA sequence which you can
download from this file: sequence_ORF.fa
We’ll use the package Biostrings to easily
import the DNA sequence:
library(Biostrings)
genome <- readDNAStringSet("data/sequence_ORF.fa")
genome
## DNAStringSet object of length 1:
## width seq names
## [1] 321 TACTCCTCCGACTTACTCAGCGC...CGTTCTTCACGAATCATGCCAGC sequence
Our first task is to write a function (let’s call it
findORFsize) which return the length of an ORF, defined
as the number of nucleotides before encountering a stop codon in a
sequence. If no stop codon is encountered before the end of the
sequence, the function should return -1.
findORFsize(startPos = 1, seq = genome[[1]])
## [1] -1
findORFsize(startPos = 2, seq = genome[[1]])
## [1] 67
The second step consists in writing a function that searches for
all possible ORFs in a sequence. To optimize things a bit, we will start
by searching all the possible start codons with:
library(GenomicRanges)
startCodons <- matches <- matchPattern("ATG", genome[[1]])
startCodons
## Views on a 321-letter DNAString subject
## subject: TACTCCTCCGACTTACTCAGCGCCGCCAACGAAG...CCTTCAGCAGACGTTCTTCACGAATCATGCCAGC
## views:
## start end width
## [1] 227 229 3 [ATG]
## [2] 314 316 3 [ATG]
We can now write the function myFindORFs wich
searches for all the ORFs in a sequence (including the ones on the
complementary strand…).
You will likely need to use the function reverseComplement for this.
library(tidyverse)
ORFs <- myFindORFs(seq = genome[[1]])
ORFs
## # A tibble: 4 × 4
## start end strand length
## <dbl> <dbl> <chr> <dbl>
## 1 227 -1 + NA
## 2 314 -1 + NA
## 3 315 13 - 302
## 4 246 13 - 233
Extract the DNA sequence corresponding to the ORF and translate
it into protein with the translate function. Then, use
the NCBI blast https://blast.ncbi.nlm.nih.gov/Blast.cgi
to identify the protein.
Use the genome browser from ensembl (http://bacteria.ensembl.org/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Info/Index/)[http://bacteria.ensembl.org/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Info/Index/]
to identify the names of the genes directly adjacent (one upstream, one
downstream) of the gene that you have just identified.
Use the function findORFsFasta from the
ORFik package to identify all the ORFs from the E.
coli genome. You can find the entire genome assembly here: https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-56/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz
Pro tip: you do not need to download this file. You can read it
into a DNAStringSet object directly with:
genome <- readDNAStringSet("https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-56/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz")
Then, the function findORFsFasta can take the DNAStringSet directly as an argument. Make sure to use the following parameters: start codon has to be “ATG” and minimum ORF size of 120 nucleotides (and no overlapping ORFs).
How many ORFs did you find in this genome? Does this number
sound reasonable?
In this section, we will use simulations to estimate how big an
ORF can be just by chance. To do this, we will use the
sample function to randomly scramble a DNA sequence.
Use the DNA sequence that we’ve been working with in the first section
(= the DNA sequence in which we’ve found an ORF of size 302).
To
speed up things, you can use the function findORFs from
the package ORFik. You will need to convert the
DNAStringSet into an R string (character) to do so:
orfs <- findORFs(as.character(sample(genome[[1]])))
The general idea is to scramble the DNA sequence (using the code above) and record the size of the longest ORF found in this scrambled sequence. Repeat this 1,000 times and record the size of the longest ORF at each of the 1,000 iterations. After performing 1,000 random simulations, you can plot the distribution of ORF sizes obtained. Your graph should look something like this:
You can follow the instructions from https://jfgout.github.io/CMB8013/R/simulations.html
to find how to perform this type of simulation.
Based on this simulation, what are the odds of finding an ORF of
size 250 or more in the DNA sequence used here?
Follow the instruction of this lab: https://jfgout.github.io/CMB8013/R/genomics-1.html and generate a graph showing the structure of the first 10 ORFs found in the E. coli genome.
This part of the lab is not required for the report (= not graded).
If you finish the first part of the lab quickly, you can explore this
extra bit of coding.
In this extra section, we will explore a
consequence of asymetries in DNA replication and repair mechanisms in
bacteria: the GC skew. See here: https://pubmed.ncbi.nlm.nih.gov/8676740/
and here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2684130/
Write a function which computes the GC skew in a sliding window
manner (the window size will be a parameter, you can try with 10,000 nt
for example, the sliding will also be a parameter, with 1/3rd of the
window size being a good starting point).
The GC skew is defined as: (G-C)/(G+C) where G and C
represent the number of guanines and cytosines respectively.
You can then compute the cumulative GC skew, which is the sum of the GC
skew from all the previous windows. A plot of the cumulative GC skew as
a function of the location alongside the chromosome assembly should look
like this: