Sunday, September 2, 2012

How to count the splice junction mapping reads

Alignment of reads to the reference genome is usually the first step in the of RNA-Seq analysis of model organisms. Currently available RNA-Seq read alignment programs such as TopHat, STAR and Jaguar generate alignment in BAM format that also contains alignment for reads that span splice-junctions (containing introns as large alignment gaps). Splice-junction reads are important for the transcript assembly and expression as well as discovery of novel splice-junctions. Counting the number of splice-junction reads present in the sequencing library may indicate the overall data quality and transcriptome coverage. For a typical good quality sequencing library of a mammalian transcriptome 20-30% of the reads may correspond to the splice junctions (this is just and observation from the several RNA-Seq papers. True proportion might be much higher).
One simple way to count the number of splice-junction reads is to scan the alignment BAM file and select the count the read-aligments that contain "N" (skipped-bases") in the CIGAR string. These are the alignments that have skipped-bases from the reference genome (most likely introns in case of RNA-Seq reads). One may wish to exclude alignments wit the small deletios (such as 1N or 2N).

I've written a small wrapper script that first uses BamUtil's  findCigar program to extarct the splice-junction alignments and writes them in a BAM file.  The BAM file is sorted and then Picard tools' CollectAlignmentSummaryMetrics program is used to count the number of reads. The sctipt is available on our lab's website.




 

1 comment:

  1. This is really useful. Thank you for the information regarding Splice junction

    ReplyDelete

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.