Tuesday, November 12, 2013

Difference between Cuffcompare and Cuffmerge



Cufflinks is one of the most commonly used program for reference-genome based transcriptome assembly, and expression estimation and differential expression analysis. Cufflinks comes with two supplementary programs (in addition to few others) for post cufflinks workflow:: Cuffcompare and Cuffmerge.

Although Cuffcompare and Cuffmerge may seem to perform same task when it comes to handling multiple transcriptome assemblie, there are sitll substantial differences among the two.Cufflinks manual, Tuxedo pipeline (Bowtie-TopHat-Cufflinks-Cuffdiff) paper and forum posts from the developers have pointed out these differences but still I get a lot of questions regarding the same. I will try to explain differences to make it more clear.

Cuffcompare and Cuffmerge both are used to merge multiple transcript assemblies but in a little different manner.
Cuffcompare takes all the transcripts from multiple assemblies (in GTF format) and creates a union of all the transcripts where all the redundant transcripts are removed. Cuffcompare does not change any of the assembled transcript in any of the assembly instead it simply compares the coordinates of the transcripts.
Resulting file, "combined.gtf", contains a set of "unified" transcripts across all the assemblies.  The "combined.gtf" file can be used as the reference GTF file for the quantification across the samples using Cuffdiff (another program in Cufflinks toolkit).

Cuffmerge, on the other hand, creates a "merged" set of transcripts form multiple assemblies. During this merging transcripts from all the assemblies (GTF files) are converted to representative reads in SAM format and Cufflinks (original assembly program) is run internally to see of there is any gaps that can be filled and a longer consensus sequence can be created. Basically, Cuffmerge merges transcripts that are overlapping  and share a similar exon structure (or splicing structure) to generate a longer chain of connected exons.


Overall, Cuffcompare will generate a non-redundant set of transcripts while Cuffmerge will generate a more consensus assembly form a multiple set of assemblies. So from Cuffmerge you get a cleaner, somewhat more complete assembly and ,generally, fewer number of assembled transcript as compared to the transcripts from Cuffcompare.

Additional note:
Cuffcompare is a more comprehensive program than simply a tool to combine assemblies. For example, ".trackinbg" file generated by Cuffcompare contains the information about how many samples each transcripts was present so that you can the idea of multiplicity (recurrence) of each transcript across multiple samples. Cuffcompare can also annotate your transcript assemblies using a reference annotation files (in GTF format) and will assign reference transcript Id( such as ensemnbl id) and gene symbol to the assembled transcripts.



Wednesday, October 2, 2013

FPKM/RPKM normalization caveat and upper quartile normalization


FPKM (fragments per kilo bases of exons for per million mapped reads) or RPKM ( fragments per kilo bases of exons for per million mapped reads) have been one of the most widely used normalization method to estimate the transcript abundance using the RNA-Seq tag counts (number of reads) where tag counts are normalized by the transcript length (exon only) as well as the total number of mappable reads in the sequencing library. But, recent studies have pointed out that this normalization method suffers from an inherent bias for very highly expressed transcripts.

For example, suppose there are two samples in the study: S1 and S2 and S1 has several highly abundant transcripts (highly expressed). By virtue of the nature of RNA-Sequencing protocol, highly expressed transcripts will generate large number of reads and as a result S1 will have overall more total number of reads than S2.

Now if there is a transcripts, lets say T, that is expressed at identical levels in both S1 and S2. But if we were to compare FPKM or RPKM of  T between samples S1 and S2, T would turn out to be down-regulated  in S1 since S1 has higher number of total reads in the and normalizing by that will decrease the expression value of T1 by a big factor.

So normalization by total number of reads can sometimes lead to false detection of differentially expressed genes or transcripts in quantitative comparison studies based on RNA-Seq.

One of the easiest solution for removing such bias is to use Upper quartile normalization that can be performed in some simple steps:

1.) Create a matrix (in an excel sheet or text file), were rows represent genes or transcripts and columns represent different samples. Each cell will contain expression (read count or any other expression metric) of each gene or transcript in each sample from the study.
2.) Remove those genes/transcripts that have 0 expression value in all the samples.

3.) Now for each column (i.e. for each sample):
     a. Sort the expression values in increasing order and find the 75th percentile value (upper quartile).
     b. Divide all the expression values by the upper quartile value

4.) This step is optional:
      Dividing original expression values by a normalization factor will result in very small values that may not    look good or easy to do downstream calculations. So, in order to scale up the normalized expression values, multiply all the expression values in the matrix by mean of upper quartiles from all the samples.

I have summarized the above steps in the formula below:






Here:
j: Transcript (jth), rows in the matrix
i: ith sample, varies from 1 to n where n being the total number of samples in the study (columns in the matrix).
Nnormij:  Normalized and scaled up expression value of jth gene or transcript in ith sample.

Nij:Original expression (raw read count) of jth gene or transcript in ith sample

Di:Upper quartile expression value in for ith sample


Various normalization methods available for the RNA-Seq data are discussed in detail in 
http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long

If you are using Cuffdiff  to estimate FPKM/RPKM then you can also perform upper quartile normalization internally by using the argument "--upper-quartile-norm".

Sunday, September 29, 2013

How much memory does TopHat require?


I've been using Tophat program for aligning the RNA-Sequencing reads to the reference genome. I mainly run my jobs on computing clusters where estimated resource (such as memory, cpu time) usage has to be specified in advance before submitting the jobs.
In the beginning, most of my Tophat jobs would fail due to the exceeding memory (RAM) usage, especially if there are hundreds of millions of reads in the fastq files. After looking into the Tophat run logs it was clear that Tophat uses extensive memory in the "joining segment hits" step where split alignments of a read are 'stitched' together to generate a contiguous alignment.  In order to determine the memory usage, I ran Tophat multiple times on a sample RNA-Seq dataset (from human transcriptome) with varying number of reads in each run.  Following are the details of Tophat run:

Reference genome: hg19 (GRCh37)
Read length: 50 (paired, insert length: 250, standard-deviation: 30)
Reference transcriptome: ensemble annotations (from UCSC)
Number of threads (cpus) assigned: 12
Fusion search: On (to get the maximum memory that Tophat may use)

Rest of the parameters for Tophat were kept on the default values.

I plotted the amount of memory used and the number of reads in the dataset (figure below).



As expected, memory usage increases (almost linearly) with the increasing number of reads in the data. Tophat needs at least of 4GB RAM to run you data against the human genome since it loads the entire genome-index in the RAM before starting the alignment. For large datasets such as 200million reads, memory usage can easily reach above 30GB (so not appropriate for desktop applications).
Hope it helps.





Friday, April 5, 2013

How to parse compressed text files using Perl

Large data files such as FASTQ files from sequencing runs may reach up to gigabytes in size and they can fill up system's disk easily and quickly. So they should better be kept compressed (*.gz or *.zip format) in order to save the disk space. Usually you save 4/5th of the original (uncompressed) file size space on the disk by compressing.
Some of the current bioinformatics tools (such as Picard tools) accept input files in the compressed format but the problem arises when files need to be parsed using custom scripts (such as perl scripts). One way is to uncompressed the files, parse it and re-compress it which may take significant amount of the computational time if data files are large(~gb to ~tb in size). Here is the simple way to parse the compressed files without uncompressing them:

#!/usr/bin/perl
use strict;
use warnings;

open FH, "gunzip -c <file_name> | ";

        while(<FH>){  ## read single line from the file
        ## parse the string ##
       }

close FH
if you have *.zip files, you can replace the command above with:

open FH, "unzip -p <file_name> | ";


The only downside of this method is that you may not be able to use file handles on your files. You can only parse files sequentially this way.

Monday, April 1, 2013

Picard "no space left on device" error

Sometimes Picard's tools terminate with the error message  "no space left on device" even though there is plenty of space available on the disk.

In addition Picard will throw some java exception error messages such as:
Exception in thread "main" net.sf.samtools.util.RuntimeIOException: Write error;
.
.
Caused by: java.io.IOException: Input/output error
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:297)
    at net.sf.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:197)
.
.
etc.


By default Picard tries to use system's temporary directory which has a limited space allocated to it by the system. If you are working with very large data files, then the default temporary directory gets filled up quickly and Picard stops working.
To avoid this error message you create a temporary directory somewhere on your system and direct picard to use that as temporary directory. Specify TMP_DIR=/some_path/your_temporary_directory/

Sunday, March 31, 2013

Why short-read alignments are always not reproducible?


Some of you may have noticed that if you re-run a short-read aligner (such as BWA or SOAP) on the same data multiple times then the reported alignments are not identical. Short read alignment is always challenging due to the reference genome complexity and  uncertainty in mapping relatively short read in the repetitive regions. So when the aligner finds a read mapping to multiple genomic loci with equal alignment score, it assigns the read randomly to a single genomic locus. Since this locus is chosen randomly by the aligner, alignments over multiple runs may not be identical.
To overcome this problem, bowtie (as well as bowtie2) follows a "pseudo random" selection procedure. Seed for the randomization is generated using the read name (and other attributes from the Fastq) for each read. It guarantees the exactly same output everytime bowtie is run. User may also specify a seed for the randomization and using the same seed will produce identical alignment from each bowtie run on the same data.

Monday, February 4, 2013

How to improve RNA-Seq alignment using TopHat

Hello readers,
TopHat is the program to align RNA-Sequencing reads to the reference genome. TopHat is a spliced-read aligner that aligns reads spanning the canonical splice-site. TopHat's performance and alignment rate (% of the mapped reads) depends on the data quality as well as the alignment stringency.  Once the TopHat run is finished, the alignment rate can be further improved by doing the junction specific tophat alignment. Later, accepted_hits.bam files from this two-step alignment can be merged to generate an improved (hopefully) alignment bam file. This idea was originally proposed in Cabili et al.

TopHat output contains a file called "junctions.bed" that has all the splice-sites detected during the alignment.  If your study contains multiple samples, "junctions.bed" files from the tophat run for each sample can be combined (or pooled) and can be used as a combined bigger junction site database for the alignment run. TopHat output also includes unmapped reads. In order to improve the overall alignment, these unmapped reads are aligned to the reference genome using the merged junction database.

For simplicity, I am dividing the whole process into four steps:

1.) Creating the pooled junction database:
You can do something like:

cat */junctions.bed > junctions.merged.bed

In order to run tophat, junctions.merged.bed file should be in a specific format called ".junc" format. This conversion can be easily done using another program, bed_to_juncs", present in the tophat program directory. junction file also needs to be sorted and duplicates must be removed.
You can run the following command to do all of the above steps (bed to junction conversion, duplicate junction removal and sorting ):

bed_to_juncs < junctions.merged.bed | sort -k 1,4 -u | sort -k 1,1 > junctions.merged.junc

Running this command will give you some warnings similar to "Warning: malformed line 1, missing columns
        track name=junctions description="TopHat junctions"
but the output file should be fine. This warning is caused due to the presence of some header lines in junctions.merged.bed.


2.) Extracting the unmapped reads (or mates):
Unmapped reads from the first tophat run should be in the tophat output directory (which is tophat_out by default) either as unmapped.bam format (for tophat2) or as unmapped_left.fq.gz and unmapped_right.fq.gz (for tophat 1).

unmapped.bam can be converted to fastq or fasta format using bamToFastQ tool or any other similar utility of your choice.
unmapped_left.fq.gz and unmapped_right.fq.gz are compressed fastq files that can be decompressed using the gunzip command.

3.) Mapping:
Mapping of unmapped reads from the first alican be done by running the tophat again and specifying input files as unmapped_left/1.fastq and unmapped_2/.fastq.
You also need to specify the junctions.merged.junc file by including the following in the tophat command line:

       -j /some_path/junctions.merged.junc --no-novel-junc
You can keep rest of the tophat parameters same as the previous tophat run.

One more thing to keep in mind is that for the second tophat run, make another output directory otherwise second tophat run will replace the output files of the original run. For convenience, you can create the second tophat run output directory within the original tophat output directory.

4.) Merging:
After the step 3, you will have two "accepted_hits.bam" files for each sample. These files can be merged together by using:
 samtools merge merged.bam /path_to_tophat_original_run_out/accepted_hits.bam /path_to_tophat_second_run_out/accepted_hits.bam


"merged.bam" is the final alignment file that you can now use for the downstream analysis.


I have observed a 3 - 10% alignment rate improvement by following the above procedure on my datasets. This improvement may vary according to the data.
Also, junctions.merged.bed may also include junctions from other sources such as reference transcript annotations. Bigger the junction database is, better will be the alignment. But make sure keep the reference genome sequence consistent.

Note: unmapped_left.fq and unmapped_right.fq may not be paired properly. They may have unequal number of reads. You don't need to worry about that while following the above procedure since tophat will take care of that during the input data file processing. If you feel that mate-pair inconsistency is causing some problem during the alignment, you can read my previous blog on "how-to-fix-order-of-paired-end-reads" to fix this inconsistency.




Monday, January 28, 2013

Re-ordering a BAM file using Picard tools

Some programs such as RNA-SeQC, a quality control program for the NGS data, require BAM (alignment files) to be sorted in the same order as the reference genome file. For example, if the reference genome file that was used to create alignments had chromosome orders as: chr1, chr21, chr4, chr5, chr6..... then BAM file should be sorted in that order only.
In order to that Picard tools has program calles "ReorderSAM" but this program require something called "sequence dictionary file" generated from the reference genome file.
This sequence file can be created using another Picard tools utility called "CreateSequenceDictionary". The problem that I found is that its manual doesn't say exactly what the output file name and format should be.

In order to run "ReorderSam", a reference sequence dictionary file with the extension of ".dict" must be present in the same directory as the reference genome sequence file directory. So if your genome file is "hg19.fa", the dictionary file should be "hg19.dict".

When you run "ReorderSam", give the full path to the reference genome fasta file (not the dictionary file) as the value for the parameter "REFERENCE". "ReorderSam" will look for the ".dict" file in the same directory.