Genomics with R - Visualizing tracks

Prelude.

For this lab, I will provide a detailed guide on how to perform the different steps and obtain all the figures presented in this page. However, some of the code is left for you to figure out. You can check that your code works by comparing the results that you will obtain to those presented in this page (all the results will be shown below).

Your lab report should consist of the full set of code required to reproduce the full set of results/figures presented on this page. As usual, generate a report using RMarkdown and html output which you then convert into a PDF file submitted via canvas.

Make sure to answer (in your lab report) all the questions asked in this page!

Introduction.

In this lab, you will learn how to visualize data in a similar way to what you can see when using a genome browser (see: https://github.com/jfgout/AppliedGenomics/wiki/Lab-04:-Genome-browsers-and-General-Feature-Format)

We will use the package Gviz, for which a detailed documentation can be found at: https://www.bioconductor.org/packages/devel/bioc/vignettes/Gviz/inst/doc/Gviz.html

This package (and the three over packages we’ll need for this lab) are part of a larger initiative named Bioconductor which aims at providing tools for analysis of genomics data in R. See: http://bioconductor.org/

The resources offered by Bioconductor are impressive, with almost 2,000 packages available as of February 2021! Sometimes, the most difficult task is to find the best package for the specific task you want to accomplish…

Installing the required packages.

Packages from Bioconductor are installed via a slightly different system than regular packages. You need to install Bioconductor first, and then use Bioconductor’s own package managing system to install these packages. But don’t worry, they made it easy for us.

First, install Bioconductor’s package manager with the following command:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

Note the if test which will call the install command only if it is required (= not already installed on your system). Because there is only one command executed in case the if statement is true, there are no brackets { and }
It would have been exactly equivalent to write it this way:

if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
}

Or this way:

if (!requireNamespace("BiocManager", quietly = TRUE)){ install.packages("BiocManager") }

Now that the bioconductor manager is installed, we can use it to install the packages required for this lab (note: this step can take a bit of time and R will probably ask you if you want to update some packages already installed. When asked the question, you can type a to say that you want to update all the packages)

BiocManager::install(c("GenomicFeatures", "rtracklayer", "Gviz"))

Note the syntax BiocManager::install which makes sure to call the function install which is defined in the BiocManager package. Because install is a very generic term, it is not impossible that there would be a different function named install in a different package. This notation makes sure that you are calling the function from the BiocManager package.

Preparations

Navigate the http://www.ensembl.org website and find the GFF annotation file for chromosome V of Caenorhabditis elegans (ensembl offers the choice of files in GTF or GFF3 format (two versions of the same thing, with only minor differences) - for this lab, we’ll use the GFF3 version).

Note: the file that you will download is compressed with gzip. You can un-compress (extract) it if you want to open it with a text editor and look at its content. Otherwise, you can keep the compressed version. The R function which we’ll be using are capable of reading data directly from a gzip-compressed file.

Getting started by loading the required packages and setting the work directory.

library(rtracklayer)
library(GenomicFeatures)
library(Gviz)

setwd("C:/lab/MSSTATE/Class/CMB8013/R")

The packages used in today’s lab are optimized to handle files provided by the UCSC Genome Browser ( https://genome.ucsc.edu/ ). Since we are using annotation files from a different source (ensembl), we need to “tell” the packages that some chromosome names will not conform to the UCSC naming conventions (i.e. the chromosome names might not start with “chr”). For example, you might find files using “1” for the name of chromosome #1 in human, while other files will use the name: “chr1” or even “chr_1” …

In the absence of a central authority regulating file formats/naming conventions, things get messy quickly and the multiplications of formats is the cause of many headaches for people working with genomics data…

The magic formula to tell the packages to ignore the “bad” chromosome names is:

options(ucscChromosomeNames=FALSE)

To load the GFF3 annotation file in R, we will use the function makeTxDbFromGFF (from the package GenomicFeatures). This function can read a GFF/GTF file and return a TxDb object which you can think of as an enhanced data.frame with information about transcripts and their location in a genome.

Exercise: Look up the documentation for makeTxDbFromGFF and find the correct syntax to load the content of Caenorhabditis_elegans.WBcel235.102.chromosome.V.gff3.gz into an object which you will name txdb

# Some R code to load the content of **Caenorhabditis_elegans.WBcel235.102.chromosome.V.gff3.gz** into an object named txdb 

# txdb = makeTxDbFromGFF( [The correct arguments for this function] )

You will see some error messages during the load of the GFF file, do not panic:

## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .extract_exons_from_GRanges(exon_IDX, gr, mcols0, tx_IDX, feature = "exon", : 230 exons couldn't be linked to a transcript so were dropped (showing
##   only the first 6):
##   seqid  start    end strand   ID        Name              Parent Parent_type
## 1     V  31881  31901      + <NA>  B0348.7.e1  transcript:B0348.7        <NA>
## 2     V 102474 102494      + <NA> F56E10.5.e1 transcript:F56E10.5        <NA>
## 3     V 144805 144825      + <NA> W03F9.12.e1 transcript:W03F9.12        <NA>
## 4     V 436982 437002      - <NA>  B0554.8.e1  transcript:B0554.8        <NA>
## 5     V 482048 482068      + <NA> T01G6.11.e1 transcript:T01G6.11        <NA>
## 6     V 586522 586542      - <NA>  K06H6.7.e1  transcript:K06H6.7        <NA>
## OK
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: ./data/Genomics/Caenorhabditis_elegans.WBcel235.102.chromosome.V.gff3
## # Organism: NA
## # Taxonomy ID: NA
## # miRBase build ID: NA
## # Genome: NA
## # Nb of transcripts: 10439
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-02-11 13:18:04 -0600 (Thu, 11 Feb 2021)
## # GenomicFeatures version at creation time: 1.42.1
## # RSQLite version at creation time: 2.2.3
## # DBSCHEMAVERSION: 1.2

Question:

Look up some of the exons for which makeTxDbFromGFF displayed an error message (using ensembl or directly in the GFF file using a text editor) and compare them to a random sample of genes for which there was no error while loading the data. You can obtain a small list of such genes to look at with the following command:

genes(txdb, columns=c("TXID", "TXNAME", "GENEID"))$GENEID[8:12]
## CharacterList of length 5
## [["B0024.11"]] B0024.11
## [["B0024.13"]] B0024.13
## [["B0024.15"]] B0024.15
## [["B0024.3"]] B0024.3
## [["B0024.4"]] B0024.4

What would be your hypothesis as to why some of these exons caused an error message? In other words, what seems to be special about these exons?

Generating and visualizing annotation tracks

Now that we have loaded the annotations in a TxDb object, we need to transform it into a format that can be displayed as a track by Gviz. For this, we will use the function GeneRegionTrack

To prevent overloading your computer (and to keep the figure usable), we will look only at the first 10,000 nucleotides (10kb) of chromosome V.

Save the result of the GeneRegionTrack function into an object named grt (for Genomic Region Track)

# grt <- GeneRegionTrack( [Correct arguments] )
grt
## GeneRegionTrack 'GeneRegionTrack'
## | genome: NA
## | active chromosome: V
## | annotation features: 49
plotTracks(grt)

One useful feature is to add a track showing the genomic position at the top of the plot. This is done with the following code:

# Preparing the position track:
axisTrack <- GenomeAxisTrack()

# Displaying the two tracks (axisTrack for the position and grt for the features) + Adding symbols on the track:
plotTracks( list(axisTrack, grt), from = lim["start"], to = lim["end"], chromosome = chr, transcriptAnnotation = "symbol")

Adding your own data.

Let’s assume that you have some data showing that regions 2,500 to 2,800 and 4,500 to 5,600 (on chromosome V) are of particular interest (for example, you find this sequence to evolve faster than the rest of the genome). We would like to display a track showing the location of these two regions relative to the annotation.
To do so, we will start by constructing a GRange object, which is the standard way of storing genomic ranges (= locations) in R:

myGr = GRanges(seqnames=rep("V", 2),
           ranges=IRanges(start=c(2500, 4500),
                          end=c(2800, 5600)),
           names = c("region-1", "region-2")
        )

We then convert the GRange object into a track and plot it:

myTrack <- AnnotationTrack(myGr, id=myGr$names, name="Fast evolving regions")

plotTracks( list(axisTrack, myTrack, grt), from = lim["start"], to = lim["end"], chromosome = chr, transcriptAnnotation = "symbol")

Exercise: Add a track showing the location of mutations (single-nucleotide ploymorphisms) that have been reported at the following locations (still on chromosome V): 2,500 ; 3,850 ; 6,237

The result should look like this:

And here is a recipe to display some values (could be sequence conservation score, read coverage, …) alongside a track:

set.seed(255) # <- This is necessary to use ransom number generator

NB_POINTS = 200 # <- We will randomly generate 200 values

lim = c(start=1, end=10e3) # <- The location of these values will be between 1 and 10kb (on chromosome V)

# These next two lines of code randomly draw 200 positions between 1 and 10kb (and makes sure that they are in increasing order):
v = seq(from=lim["start"], to=lim["end"], by=1)
vs = sample(v)[1:NB_POINTS]
vCoords = vs[order(vs, decreasing=F)]


dat <- runif(NB_POINTS, min = 0, max = 10) # <- Generating the 200 random values

# Making a track out of this data:
dtrack <- DataTrack(data = dat, start = vCoords, end = vCoords, chromosome = chr, genome = gen, name = "Random values")

# And displaying the track:
plotTracks( list(axisTrack, grt, dtrack), from = lim["start"], to = lim["end"], chromosome = chr, transcriptAnnotation = "symbol")

Same thing, but displaying the data as a histogram instead of points:

plotTracks( list(axisTrack, grt, dtrack), from = lim["start"], to = lim["end"], chromosome = chr, transcriptAnnotation = "symbol", type="histogram")

And one more style of displaying the data:

plotTracks( list(axisTrack, grt, dtrack), from = lim["start"], to = lim["end"], chromosome = chr, transcriptAnnotation = "symbol", type="polygon")

Displaying a track of mapped reads.

In this last section of the lab, we will visualize data from an RNAseq experiment. In preparation for this lab, I downloaded an RNAseq data set from an experiment that was investigating the genes up/down regulated in response to heat-shock in C. elegans.
I mapped all the reads from the RNAseq library prepared from nematodes grown at 15C and extracted the subset that mapped within the first 10 kb of chromosome V (working on a small subset of the data prevents you from having to handle large files which might bring your laptop to its knees…).

Download these two files and save them in the same folder on your laptop:

Note: bam files are the binary version of sam files. If you try to open them with a text editor, the content will look like a bunch of strange characters:

.

If you are curious and want to see what the actual data looks like, you can download the sam version of this file here: mini.sam

Load the alignment track with the following code:

alnt <- AlignmentsTrack("./data/Genomics/celegans-heat-shock/mini.bam")

Now, we can plot our tracks, focusing on the region between 6 and 8kb:

plotTracks( list(axisTrack, grt, alnt), from = lim["start"], to = lim["end"], chromosome = chr, transcriptAnnotation = "symbol")

We can zoom in on this region: 7.35 to 7.45kb:

plotTracks( list(axisTrack, grt, alnt), from = lim["start"], to = lim["end"], chromosome = chr, transcriptAnnotation = "symbol")

Version with coverage graph only, not individual reads:

plotTracks( list(axisTrack, grt, alnt), from = lim["start"], to = lim["end"], chromosome = chr, transcriptAnnotation = "symbol", type="coverage")