Monday, September 24, 2012

Cuffcompare does not produce p_id

Lately I've been using cufflinks, cuffcompare, cuffdiff and cummeRbund pipeline (Trapnell et al. ) to analyze my RNA-Seq datasets. I am annotating the cufflinks assembled transcripts with the reference human annotated transcripts using cuffcompare. When cuffcompare finds a match between known (annotated) ORF and the assembled transcript, it attaches the attribute "p_id" (abbreviated for protein_id) in the output combined.gtf file. This attribute is required for running cummeRbund successfully in the downstream analysis.
For some reason cuffcompare was generating output files without this attribute when I ran it for my data.  From some discussion on Seqanswers forum, I figured out the cuffcompare needs to be run with "-s" argument in order to get the protein_ids in the output file.
"-s" options directs cuffcompare to look for repeats in the transcripts.
I am not sure if this is a bug or some logical error in the program, but the trick worked for me.


Sunday, September 2, 2012

How to count the splice junction mapping reads

Alignment of reads to the reference genome is usually the first step in the of RNA-Seq analysis of model organisms. Currently available RNA-Seq read alignment programs such as TopHat, STAR and Jaguar generate alignment in BAM format that also contains alignment for reads that span splice-junctions (containing introns as large alignment gaps). Splice-junction reads are important for the transcript assembly and expression as well as discovery of novel splice-junctions. Counting the number of splice-junction reads present in the sequencing library may indicate the overall data quality and transcriptome coverage. For a typical good quality sequencing library of a mammalian transcriptome 20-30% of the reads may correspond to the splice junctions (this is just and observation from the several RNA-Seq papers. True proportion might be much higher).
One simple way to count the number of splice-junction reads is to scan the alignment BAM file and select the count the read-aligments that contain "N" (skipped-bases") in the CIGAR string. These are the alignments that have skipped-bases from the reference genome (most likely introns in case of RNA-Seq reads). One may wish to exclude alignments wit the small deletios (such as 1N or 2N).

I've written a small wrapper script that first uses BamUtil's  findCigar program to extarct the splice-junction alignments and writes them in a BAM file.  The BAM file is sorted and then Picard tools' CollectAlignmentSummaryMetrics program is used to count the number of reads. The sctipt is available on our lab's website.