Monday, February 4, 2013

How to improve RNA-Seq alignment using TopHat

Hello readers,
TopHat is the program to align RNA-Sequencing reads to the reference genome. TopHat is a spliced-read aligner that aligns reads spanning the canonical splice-site. TopHat's performance and alignment rate (% of the mapped reads) depends on the data quality as well as the alignment stringency.  Once the TopHat run is finished, the alignment rate can be further improved by doing the junction specific tophat alignment. Later, accepted_hits.bam files from this two-step alignment can be merged to generate an improved (hopefully) alignment bam file. This idea was originally proposed in Cabili et al.

TopHat output contains a file called "junctions.bed" that has all the splice-sites detected during the alignment.  If your study contains multiple samples, "junctions.bed" files from the tophat run for each sample can be combined (or pooled) and can be used as a combined bigger junction site database for the alignment run. TopHat output also includes unmapped reads. In order to improve the overall alignment, these unmapped reads are aligned to the reference genome using the merged junction database.

For simplicity, I am dividing the whole process into four steps:

1.) Creating the pooled junction database:
You can do something like:

cat */junctions.bed > junctions.merged.bed

In order to run tophat, junctions.merged.bed file should be in a specific format called ".junc" format. This conversion can be easily done using another program, bed_to_juncs", present in the tophat program directory. junction file also needs to be sorted and duplicates must be removed.
You can run the following command to do all of the above steps (bed to junction conversion, duplicate junction removal and sorting ):

bed_to_juncs < junctions.merged.bed | sort -k 1,4 -u | sort -k 1,1 > junctions.merged.junc

Running this command will give you some warnings similar to "Warning: malformed line 1, missing columns
        track name=junctions description="TopHat junctions"
but the output file should be fine. This warning is caused due to the presence of some header lines in junctions.merged.bed.


2.) Extracting the unmapped reads (or mates):
Unmapped reads from the first tophat run should be in the tophat output directory (which is tophat_out by default) either as unmapped.bam format (for tophat2) or as unmapped_left.fq.gz and unmapped_right.fq.gz (for tophat 1).

unmapped.bam can be converted to fastq or fasta format using bamToFastQ tool or any other similar utility of your choice.
unmapped_left.fq.gz and unmapped_right.fq.gz are compressed fastq files that can be decompressed using the gunzip command.

3.) Mapping:
Mapping of unmapped reads from the first alican be done by running the tophat again and specifying input files as unmapped_left/1.fastq and unmapped_2/.fastq.
You also need to specify the junctions.merged.junc file by including the following in the tophat command line:

       -j /some_path/junctions.merged.junc --no-novel-junc
You can keep rest of the tophat parameters same as the previous tophat run.

One more thing to keep in mind is that for the second tophat run, make another output directory otherwise second tophat run will replace the output files of the original run. For convenience, you can create the second tophat run output directory within the original tophat output directory.

4.) Merging:
After the step 3, you will have two "accepted_hits.bam" files for each sample. These files can be merged together by using:
 samtools merge merged.bam /path_to_tophat_original_run_out/accepted_hits.bam /path_to_tophat_second_run_out/accepted_hits.bam


"merged.bam" is the final alignment file that you can now use for the downstream analysis.


I have observed a 3 - 10% alignment rate improvement by following the above procedure on my datasets. This improvement may vary according to the data.
Also, junctions.merged.bed may also include junctions from other sources such as reference transcript annotations. Bigger the junction database is, better will be the alignment. But make sure keep the reference genome sequence consistent.

Note: unmapped_left.fq and unmapped_right.fq may not be paired properly. They may have unequal number of reads. You don't need to worry about that while following the above procedure since tophat will take care of that during the input data file processing. If you feel that mate-pair inconsistency is causing some problem during the alignment, you can read my previous blog on "how-to-fix-order-of-paired-end-reads" to fix this inconsistency.




27 comments:

  1. Hi Vinay, I am trying this out. FOr step 2, I have paired end reads. There is a single unmapped.bam file. I am running bam2fastx with -o and -P options but running into errors. It generates a fastq without -P option. Now, if I align reads from this single fastq against reference and then later merge with "accepted_hits.bam", should I expect the merged.bam to work? Is the pairing messed up in this process?

    ReplyDelete
  2. Hi Harsha,
    Unmapped.bam contains unmapped "pairs" as well as unmapped "mates". You will need to use programs that handle these situations. The one I use (bamtFastQ) is available from:
    http://genome.sph.umich.edu/wiki/BamUtil:_bam2FastQ
    This program works can separate paired and unpaired mates.

    You can align single end reads against the genome and merge with the paired-end bam file but downstream analysis programs may not work.
    By default, tophat manual says that tophat doesn't allow mixing paired-end and single end reads in single run.
    My recommendation is no to align paired-end reads as single end reads. You may run into problems during the downstream assembly process.
    Let me know if you need any help with the bam2FastQ program.

    ReplyDelete
  3. Thanks for the bamUtil tool Vinay. I have installed it and running on the first sample's unmapped.bam file. Converting it to paired end fastq files is taking longer than the original tophat run itself. Hence, I am thinking, instead of trying to regenerate the fastqs and re-running tophat, it may be faster and better to re-run tophat on the original fastq files containing all reads.

    ReplyDelete
  4. The link for bam2FastQ is very useful, thanks. Picard failed with memory errors and other bam2fasta tools don't cope very well with paired end data.

    ReplyDelete
  5. Thanks for sharing, bam2FastQ is also work well on my data. However, it really spends time. My 4G data (unmapped reads) may take 10 days to finish the fastq extraction. Do you have any suggestion to accelerate the speed? thx!

    ReplyDelete
  6. Miller,
    You need to sort the unmapped.bam on queryname. Then run bam2FastQ with --readname argument.

    ReplyDelete
  7. Hi Vinay, it does work and very fast! thx!

    ReplyDelete
  8. Thanks for this post . I have a question. I ran trimmometric to get rid of illumina adapter sequences in my rnaseq data and I ended up with two files with paired reads and two for the unpaired reads. Since tophat only handles one kind of read at a time, how willl you do the analysis? Run top hat first for the paired reads and then two more times for the unpaired reads and combine all of them the way that you suggest here?

    ReplyDelete
  9. Hi,
    You are right. Tophat does not except mixed (paired and unpaired) reads. You can run paired-end reads first and re-run tophat on unmapped reads using the junction database(this is optional). Merge the two bam files.
    Now you can merge youre unpaired fastq files (simple cat would do it)and run Tophat on them in single-end mode.
    Now you can merge the resulting alignment bam nfile with the alignment file generated from the paired-end read mapping and sort the merged bam file in the end.

    Similar strategy is also mentioned on the Tophat manual page: http://tophat.cbcb.umd.edu/manual.shtml under the section "Using TopHat".

    In my opinion, if you paried-end read data is in 10's of gigabytes and you can map good % of your paired-end reads mapped to the genome then I wouldn't worry about the unpaired reads if they were in couple of megabytes.

    ReplyDelete
  10. Hi Vinay, thanks for answering. I am kind of in between, meaning one of my files that I just ran has about 15GB in each of hte paired reads, but about 4GB in one fo the unpaired reads, and 65MB in the other one. The concordant pair aligment for that particular sample is 83.4%. SO I think I need to also do them.

    And one more question, before I cat the two unpaired reads files, should I add either /1 and /2 to the end of each read in those two files? Aura Perez

    ReplyDelete
  11. I don't think you need to add /1 or /2 tags to the read names as these reads were left unpaired since their mates were trimmed completely and will be missing from the rest of the data. So I don't the read name, by itself, will create any confusion in the merging process.

    ReplyDelete
  12. ok, only the last sentence is confusing to me, but what I get os that I do nto need to do anything to the unpaired files, just cat them and do tophat? Aura Perez

    ReplyDelete
  13. No need to apologize. I am quite grateful that you took the time to answer me. Aura Perez

    ReplyDelete
  14. Thank you Vinay

    Aura Perez

    ReplyDelete
  15. One more question. When I do the tophat run for the unpaired fastq files in single-end mode, I assumed that I will also be using the junction database with this run too?

    Aura Perez

    ReplyDelete
  16. You can do it both ways but since you have already run the paired-end files, you must have covered pretty much all the junction sites in the genome. Running unpaired reads against the known junctions will also save you some alignment time.

    ReplyDelete
  17. Thank you very much
    Aura Perez

    ReplyDelete
  18. Good morning Vinay. I have another question for you. I am about to begin my cufflink run and not quite sure what option to use under library-type since I will have combined reads in my merged accepted_hits.bam file.

    I have merged 3 bam files: the original tophat run with the paired reads, the tophat run with the extracted pair reads from the unmapped reads, and the tophat run after merging the unpaired reads files from trimommatic.

    Somebody told me that they usually use the fr-seconstrand option for their cufflinks analysis, and I am not sure if this applies here. Thanks for any advice

    Aura Perez

    ReplyDelete
  19. Hi Aura,
    You should use the format for your paired-end libraries. Don't worry about the unpaired reads that you merged later.
    If you look at your original accepted_hits.bam from your paired-end read mapping from the Tophat, you will find lots of reads with missing (unmapped) mates. Cufflinks runs fine on them so there shouldn't be any problem with your merged bam files.

    ReplyDelete
  20. For the final merged bam file, do I need to sort it using the -n option in samtools or without that?
    Aura Perez

    ReplyDelete
  21. Sort it on coordinates.. Probably it is already sorted.

    ReplyDelete
  22. Yes, I did not realize that when you use the merge command in samtools, the output is sorted. Aura Perez

    ReplyDelete
  23. Hi Vinay,

    What according to you is the best way to extract uniquely mapping reads from tophat alignment. Actually I had used -g 1 (max-multi-hits) option in tophat but that is not exactly what I wanted so further counted unique reads with mapping quality of 30 using samtools using the following command :-

    samtools view -cq 30 uniq_3-R1_TAGCTT_L004_R1_001.bam

    and then with using NH:i:1 using the following command
    samtools view uniq_3-R1_TAGCTT_L004_R1_001.bam |grep -c NH:i:1

    but I still seem to get the exact same number of reads from both commands which happens to be the total number of reads also in the bam file. So I am confused as to what should be a better way to extract uniquely mapping reads from tophat bam files. Usually the samtools option for mapping quality works with bwa etc



    ReplyDelete
  24. Saad,
    The best way to get the uniquely mapped hits is filtering for the NH:i:1 tag. Since, you used -g =1 with tophat so the number of alignments reported in the bam file should be the total number of reads as there are no secondary alignments reported.
    Now since filtering for 'NH:i:i' is giving you the number equivalent to the number of reads in the bam file, it suggests that you already have only uniquely mapped reads in the bam file which seems little unlikely.
    Can you try to filter out for alignments with 'NH:i:2' or more and see if you get anything?

    ReplyDelete
  25. Thanks for the suggestion Vinay.
    Could you tell me if these 4 approaches are possible to be done with Tophat version 2?
    Thanks in advance!

    ReplyDelete
  26. Hi Caio,
    Yes, you can use this strategy for tophat2. Remember though that the unmapped reads will be in 'unmapped.bam' as opposed to nmapped_left.fq.gz and unmapped_right.fq.gz for tophat(1)

    ReplyDelete

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.