Searching for Open Reading Frames (ORFs) in DNA sequences

In this lab, we will learn how to perform a basic search for Open Reading Frames (ORFs) in a DNA sequence. We will start by writing our own ORF hunting function as a way to practice our algorithm invention skills and then move onto using available tools.

You will need to submit a lab report, in the form of an RMarkdown document with the following items:

A function that finds the size of an ORF starting from a given position in a sequence.
A function that finds all possible ORFs in a sequence.
The application of these function to a simple dataset and the identification of the gene in this dataset.
A simulation of the expected size of ORFs in random sequences.

A simple ORF hunting function

For this part, we’ll use a small DNA sequence which you can download from this file: sequence_ORF.fa

We’ll use the package Biostrings to easily import the DNA sequence:

library(Biostrings)

genome <- readDNAStringSet("data/sequence_ORF.fa")

genome

## DNAStringSet object of length 1:
##     width seq                                               names               
## [1]   321 TACTCCTCCGACTTACTCAGCGC...CGTTCTTCACGAATCATGCCAGC sequence

Our first task is to write a function (let’s call it findORFsize) which return the length of an ORF, defined as the number of nucleotides before encountering a stop codon in a sequence. If no stop codon is encountered before the end of the sequence, the function should return -1.

findORFsize(startPos = 1, seq = genome[[1]])

## [1] -1

findORFsize(startPos = 2, seq = genome[[1]])

## [1] 67

The second step consists in writing a function that searches for all possible ORFs in a sequence. To optimize things a bit, we will start by searching all the possible start codons with:

library(GenomicRanges)

startCodons <- matches <- matchPattern("ATG", genome[[1]])

startCodons

## Views on a 321-letter DNAString subject
## subject: TACTCCTCCGACTTACTCAGCGCCGCCAACGAAG...CCTTCAGCAGACGTTCTTCACGAATCATGCCAGC
## views:
##       start end width
##   [1]   227 229     3 [ATG]
##   [2]   314 316     3 [ATG]

We can now write the function myFindORFs wich searches for all the ORFs in a sequence (including the ones on the complementary strand…).

You will likely need to use the function reverseComplement for this.

library(tidyverse)

ORFs <- myFindORFs(seq = genome[[1]])
ORFs

## # A tibble: 4 × 4
##   start   end strand length
##   <dbl> <dbl> <chr>   <dbl>
## 1   227    -1 +          NA
## 2   314    -1 +          NA
## 3   315    13 -         302
## 4   246    13 -         233

Extract the DNA sequence corresponding to the ORF and translate it into protein with the translate function. Then, use the NCBI blast https://blast.ncbi.nlm.nih.gov/Blast.cgi to identify the protein.

Use the genome browser from ensembl (http://bacteria.ensembl.org/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Info/Index/)[http://bacteria.ensembl.org/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Info/Index/] to identify the names of the genes directly adjacent (one upstream, one downstream) of the gene that you have just identified.

Application to an entire genome

Use the function findORFsFasta from the ORFik package to identify all the ORFs from the E. coli genome. You can find the entire genome assembly here: https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-56/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz
Pro tip: you do not need to download this file. You can read it into a DNAStringSet object directly with:

genome <- readDNAStringSet("https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-56/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz")

Then, the function findORFsFasta can take the DNAStringSet directly as an argument. Make sure to use the following parameters: start codon has to be “ATG” and minimum ORF size of 120 nucleotides (and no overlapping ORFs).

How many ORFs did you find in this genome? Does this number sound reasonable?

Simulating ORF sizes

In this section, we will use simulations to estimate how big an ORF can be just by chance. To do this, we will use the sample function to randomly scramble a DNA sequence. Use the DNA sequence that we’ve been working with in the first section (= the DNA sequence in which we’ve found an ORF of size 302).
To speed up things, you can use the function findORFs from the package ORFik. You will need to convert the DNAStringSet into an R string (character) to do so:

orfs <- findORFs(as.character(sample(genome[[1]])))

The general idea is to scramble the DNA sequence (using the code above) and record the size of the longest ORF found in this scrambled sequence. Repeat this 1,000 times and record the size of the longest ORF at each of the 1,000 iterations. After performing 1,000 random simulations, you can plot the distribution of ORF sizes obtained. Your graph should look something like this:

You can follow the instructions from https://jfgout.github.io/CMB8013/R/simulations.html to find how to perform this type of simulation.

Based on this simulation, what are the odds of finding an ORF of size 250 or more in the DNA sequence used here?

Going the extra mile

Representing ORFs in tracks

Follow the instruction of this lab: https://jfgout.github.io/CMB8013/R/genomics-1.html and generate a graph showing the structure of the first 10 ORFs found in the E. coli genome.

GC skew calculations

This part of the lab is not required for the report (= not graded). If you finish the first part of the lab quickly, you can explore this extra bit of coding.
In this extra section, we will explore a consequence of asymetries in DNA replication and repair mechanisms in bacteria: the GC skew. See here: https://pubmed.ncbi.nlm.nih.gov/8676740/ and here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2684130/

Write a function which computes the GC skew in a sliding window manner (the window size will be a parameter, you can try with 10,000 nt for example, the sliding will also be a parameter, with 1/3rd of the window size being a good starting point).
The GC skew is defined as: (G-C)/(G+C) where G and C represent the number of guanines and cytosines respectively.
You can then compute the cumulative GC skew, which is the sum of the GC skew from all the previous windows. A plot of the cumulative GC skew as a function of the location alongside the chromosome assembly should look like this: