Monday, February 4, 2013

How to improve RNA-Seq alignment using TopHat

Hello readers,
TopHat is the program to align RNA-Sequencing reads to the reference genome. TopHat is a spliced-read aligner that aligns reads spanning the canonical splice-site. TopHat's performance and alignment rate (% of the mapped reads) depends on the data quality as well as the alignment stringency.  Once the TopHat run is finished, the alignment rate can be further improved by doing the junction specific tophat alignment. Later, accepted_hits.bam files from this two-step alignment can be merged to generate an improved (hopefully) alignment bam file. This idea was originally proposed in Cabili et al.

TopHat output contains a file called "junctions.bed" that has all the splice-sites detected during the alignment.  If your study contains multiple samples, "junctions.bed" files from the tophat run for each sample can be combined (or pooled) and can be used as a combined bigger junction site database for the alignment run. TopHat output also includes unmapped reads. In order to improve the overall alignment, these unmapped reads are aligned to the reference genome using the merged junction database.

For simplicity, I am dividing the whole process into four steps:

1.) Creating the pooled junction database:
You can do something like:

cat */junctions.bed > junctions.merged.bed

In order to run tophat, junctions.merged.bed file should be in a specific format called ".junc" format. This conversion can be easily done using another program, bed_to_juncs", present in the tophat program directory. junction file also needs to be sorted and duplicates must be removed.
You can run the following command to do all of the above steps (bed to junction conversion, duplicate junction removal and sorting ):

bed_to_juncs < junctions.merged.bed | sort -k 1,4 -u | sort -k 1,1 > junctions.merged.junc

Running this command will give you some warnings similar to "Warning: malformed line 1, missing columns
        track name=junctions description="TopHat junctions"
but the output file should be fine. This warning is caused due to the presence of some header lines in junctions.merged.bed.


2.) Extracting the unmapped reads (or mates):
Unmapped reads from the first tophat run should be in the tophat output directory (which is tophat_out by default) either as unmapped.bam format (for tophat2) or as unmapped_left.fq.gz and unmapped_right.fq.gz (for tophat 1).

unmapped.bam can be converted to fastq or fasta format using bamToFastQ tool or any other similar utility of your choice.
unmapped_left.fq.gz and unmapped_right.fq.gz are compressed fastq files that can be decompressed using the gunzip command.

3.) Mapping:
Mapping of unmapped reads from the first alican be done by running the tophat again and specifying input files as unmapped_left/1.fastq and unmapped_2/.fastq.
You also need to specify the junctions.merged.junc file by including the following in the tophat command line:

       -j /some_path/junctions.merged.junc --no-novel-junc
You can keep rest of the tophat parameters same as the previous tophat run.

One more thing to keep in mind is that for the second tophat run, make another output directory otherwise second tophat run will replace the output files of the original run. For convenience, you can create the second tophat run output directory within the original tophat output directory.

4.) Merging:
After the step 3, you will have two "accepted_hits.bam" files for each sample. These files can be merged together by using:
 samtools merge merged.bam /path_to_tophat_original_run_out/accepted_hits.bam /path_to_tophat_second_run_out/accepted_hits.bam


"merged.bam" is the final alignment file that you can now use for the downstream analysis.


I have observed a 3 - 10% alignment rate improvement by following the above procedure on my datasets. This improvement may vary according to the data.
Also, junctions.merged.bed may also include junctions from other sources such as reference transcript annotations. Bigger the junction database is, better will be the alignment. But make sure keep the reference genome sequence consistent.

Note: unmapped_left.fq and unmapped_right.fq may not be paired properly. They may have unequal number of reads. You don't need to worry about that while following the above procedure since tophat will take care of that during the input data file processing. If you feel that mate-pair inconsistency is causing some problem during the alignment, you can read my previous blog on "how-to-fix-order-of-paired-end-reads" to fix this inconsistency.