Monday, November 26, 2012

How to parse a BAM file using Perl

Hello readers,
I have been working with the analysis of junction reads using bam files generated by RNA-Seq read reference genome alignment. I had to write my own customized perl scripts to perform my analysis. Since BAM files are binary files and not simple ASCII text files, these can not be read by perl's default IO directly. Usually people use BioPerl modules to access the BAM files but here is an easier way to do it in perl:


#!/usr/bin/perl
use strict;
use warnings;

open BAM,"samtools view <bam_file> |";

        while(<BAM>){
        next if(/^(\@)/);  ## skipping the header lines (if you used -h in the samools command)
        s/\n//;  s/\r//;  ## removing new line
        my @sam = split(/\t+/);  ## splitting SAM line into array

        ## now use @sam as regular array
       }

Note:
You need to have samtools installed on your system.
Samtools will output alignments as SAM format that follows 1-based coordinate system. So parse the alignment data accordingly.


       
               



Monday, September 24, 2012

Cuffcompare does not produce p_id

Lately I've been using cufflinks, cuffcompare, cuffdiff and cummeRbund pipeline (Trapnell et al. ) to analyze my RNA-Seq datasets. I am annotating the cufflinks assembled transcripts with the reference human annotated transcripts using cuffcompare. When cuffcompare finds a match between known (annotated) ORF and the assembled transcript, it attaches the attribute "p_id" (abbreviated for protein_id) in the output combined.gtf file. This attribute is required for running cummeRbund successfully in the downstream analysis.
For some reason cuffcompare was generating output files without this attribute when I ran it for my data.  From some discussion on Seqanswers forum, I figured out the cuffcompare needs to be run with "-s" argument in order to get the protein_ids in the output file.
"-s" options directs cuffcompare to look for repeats in the transcripts.
I am not sure if this is a bug or some logical error in the program, but the trick worked for me.


Sunday, September 2, 2012

How to count the splice junction mapping reads

Alignment of reads to the reference genome is usually the first step in the of RNA-Seq analysis of model organisms. Currently available RNA-Seq read alignment programs such as TopHat, STAR and Jaguar generate alignment in BAM format that also contains alignment for reads that span splice-junctions (containing introns as large alignment gaps). Splice-junction reads are important for the transcript assembly and expression as well as discovery of novel splice-junctions. Counting the number of splice-junction reads present in the sequencing library may indicate the overall data quality and transcriptome coverage. For a typical good quality sequencing library of a mammalian transcriptome 20-30% of the reads may correspond to the splice junctions (this is just and observation from the several RNA-Seq papers. True proportion might be much higher).
One simple way to count the number of splice-junction reads is to scan the alignment BAM file and select the count the read-aligments that contain "N" (skipped-bases") in the CIGAR string. These are the alignments that have skipped-bases from the reference genome (most likely introns in case of RNA-Seq reads). One may wish to exclude alignments wit the small deletios (such as 1N or 2N).

I've written a small wrapper script that first uses BamUtil's  findCigar program to extarct the splice-junction alignments and writes them in a BAM file.  The BAM file is sorted and then Picard tools' CollectAlignmentSummaryMetrics program is used to count the number of reads. The sctipt is available on our lab's website.




 

Friday, August 24, 2012

Computational bottleneck in fusion-transcript detection using RNA-Seq data

Hello readers,
For past couple of months I have been working on de-novo assemblies of  human transcriptome (RNA-Seq ) datasets.  The goal is to detect the chimeric transcripts that requires alignment of the assembled contigs against the reference genome. This alignment is usually done by BLAT  and for a typical mammalian transcriptome assembly, may take up to several weeks to finish.
Long assembled contigs, sluggish performance of BLAT and lack of parallelism in BLAT can can create a computational bottleneck if one is dealing with multiple samples.
One simple way of easing this problem is to reduce the initial number of contigs to align by selecting only the potential chimeric transcripts for the reference genome alignment using BLAT.
This can be easily done by first aligning the assembled contigs to reference genome annnotations such as Gencode or Ensembl reference transcript sequences using MegaBlast. Since MegaBlast can be run on multiple threads and alignment is being done against the whole transcriptome rather than the whole genome, this step should not take more than 5-6 hours on a typical server.
In the Blast output file, query sequences (assembled contigs in this case) representing potential chimeric transcripts will not produce alignment with good query coverage in single alignment hits. I usually select those contigs as potential chimeric transcripts that have <90% of the contig length covered in the alignment and also leaving at least 25 bases unaligned on either or both ends (alignmentStart > 25 and/or (queryLength - alignmentEnd < 25)).
This pre-genomic-alignment filtering will reduce the query data file to 20% of the original file hence the alignment time.

This alignment may be further speed-up if simple and low-complexity end-bases are trimmed from the assembled contigs. It can be done easily using Dustmasker or WindowsMasker. This trimming may also help reducing fasle-positives in chimeric transcript detection.

Friday, July 27, 2012

Tophat-Fusion and Picard SortSam

I ran TopHat2 on my RNA-Seq paired-end reads with "fusion-search" on in order to detect some fusion events. The resulting output file, "accepted_hits.bam", is coordinate sorted by default. I tried to sort  it by queryname using Picard tools' SortSam.jar. In doing so, SortSam.jar terminates with an error about chromosome coordinate mistatch between mates. After pondering over the error for a moment, I realized that SortSam expects both of mates to be aligned to the same chromosome and throws an error instance in case of any inconsistency which is in-fact the case with the mates spanning a fusion event.
The solution is simple: Initiate the SortSam.jar with VALIDATION_STRINGENCY=SILENT argument.
An alternative is to use the samtools sort but I prefer Picard's SortSam as it is much faster than the samtools sort.

Thursday, May 31, 2012

How to fix the order of paired-end reads

Sometimes paired end reads stored in two fastq files (each corresponding to left and right end reads, respectively), do not appear in the same order in both files. I encountered this problem with the RNA-Seq data sets from ENCODE. In theory, both reads from one insert should appear in the same order. This order is important for downstream alignment tools such as BWA.
To fix the order I have written some scripts that use Picards tools. These scripts can be found on our lab website.
Basic idea is to convert each fastq file to SAM file (with reads marked as unmapped) individually using FastqToSam program. Then merge both SAM files in to one using MergeSamFiles and sort the merged file on read name using SortSam. Once the file is sorted, reads can be extracted easily from the SAM file. For this purpose I had to write a perl script (also available on the link above) since the merged SAM file will be missing proper header and conventional programs such as SamToFastq and BamToFastq won't work here.
Before proceeding with the scripts, please go through the instructions.

Thursday, April 5, 2012

Human genome annotation resources

For all the the folks who are doing RNA-Seq analysis and looking for a comprehensive annotation (esp. for the human genome), I found a decent table in the Supplementary materials of the paper by Cabili et al. (Integrative Annotation of Human Large Intergenic Non-Coding RNAs Reveals Global Properties and Specific Subclasses, Genes Dev., Sept 2, 2011) pubmed. These annotations prove to be very useful during the filtering of transcripts after the transcriptome assembly. rRNA and tRNA gtf file can be supplied to Cufflinks (using -M option) during the assembly process in order to exclude reads mapping to these RNAs. Exclusion of such reads can improve abundance estimates of other more useful (protein coding, lincRNAs etc) transcripts.
I am pasting the table below:

Tuesday, February 28, 2012

How to estimate insert size for paired-end RNA-Seq data

Reference genome alignment high-throughput RNA-seq reads generated from "paired-end" library often requires "mate-pair insert length" and associated standard deviation for alignment tools such as BowTie, TopHat or BWA to work properly. These values can be obtained from the library prep that was used for the sequencing. But these values are not always accompanied with the data in public databases hence need to be estimated computationally. Here are some simple steps:

Required sowftare: BowTie or BWA, Picard tools.
Reference transcriptome: Reference transcript sequences from the same species as the RNA-Seq data is from. Instead of reference transcript, reference genome can also be used. I prefer reference transcriptome since it saves a magnitude of alignment as well as downstream processing time over the whole genome and also generates more alignments as splice junction reads are easily aligned to the transcript sequence than the reference genome sequence.

Steps:
1.)Create reference transcriptome index for BowTie (or BWA).
2.)Align paired-end RNA-Seq data to the reference transcripts using minimum insert-length as zero and maximum insert-length as 500 (or more).
BowTie: -I 0 -X 500, BWA: -a 500. Set output format as SAM(BowTie) or BAM(BWA).
3.) Sort resulting SAM/BAM file using Picard's SortSam.jar utility as follows:
java -Xmx2g -jar picard-tools/SortSam.jar INPUT= OUTPUT= SORT_ORDER=coordinate
4.) Now run Picard's CollectInsertSizeMetrics.jar utility as follows:
java -Xmx2g -jar picard-tools/CollectInsertSizeMetrics.jar INPUT= HISTOGRAM_FILE= OUTPUT=
5.)Insert-size distribution is printed in output_file while corresponding histogram image file is generated as pdf in histogram_file.
6.)Insert-matrix output_file conain estimated insert-size and corresponding standard deviation. Picard generates two types estimates: "MEDIAN_INSERT_SIZE" and "MEAN_INSERT_SIZE". Anyoneone of them can be used but I prefer the second one since it is estmated after trimming off the outliers in from the original insert-size distribution in order to render normal distribution.

Note: If you plan to run TopHat on the same data aftrwards, set -r as (estimated insert-size-2*read_length) since TopHat requires distance between mate-pairs and not the insert-size.

Monday, February 27, 2012

How to extract paired-end reads from SRA files

SRA(NCBI) stores all the sequencing run as single "sra" or "lite.sra" file. You may want separate files if you want to use the data from paired-end sequencing. When I run SRA toolkit's "fastq-dump" utility on paired-end sequencing SRA files, sometimes I get only one files where all the mate-pairs are stored in one file rather than two or three files.
The solution for the problem is to always run fastq-dump with "--split-3" option. If the experiment is single-end sequencing, only one fastq file will be generated. If it is paired-end sequencing, there may be two or three fastq files.
Two files (with suffix "_1" and "_2") are matched mate-pair read file where as the third one (without any suffix) contains all the reads that do not have any mate-paires (or SRA couldn't resolve mate-paires for them).

Hope my experiences with NCBI SRA data handling help the readership.

Thursday, February 23, 2012

Update in sra toolkit

I just started working on TCGA high-throughput RNA-Seq data from SRA (short read archive). I was trying to convert lite.sra files to fastq format usign sra-toolkit's fastq-dump utility that I downloaded couple of months ago. It seems the older binary files are not working properly on the data from TCGA.
Anyone who is also facing the same problem, please update your sra toolkit binary files to version 2.1.9 that works very well with the TCGA data.

Thursday, February 2, 2012

RNA-seq pipeline

Hello readers,
I've been working on a computational pipeline to analyze RNA-Seq data in order to detect alternatice splicing and chimeric transcripts. It got published last month in Nucleic Acid Research. Please find the paper at:
http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks047.long

The program is freely available at: www.mcdonaldlab.biology.gatech.edu/r-sap.htm.
The pipeline is actively under development so feel free to use it and any suggestions and feedbacks are most welcome.

How to visualize transcript alignments

Sorry readers,
Was very busy with my work so couldn't post any updates.
Lately I've been working with human cancer transcriptome assemblies. I mainly wanted to find out the changes in 3'end of the assembled transcripts as compared with the "reference" (normal transcripts). I aligned my short reads to the Cufflinks assembled transcripts using BWA.
Then the question is how do I visualize the alignments. There are plenty of visualizer to visualize reference genome alignment but couldn't find anything for transcripts (may be didn't search hard enough for it).
Then I though of using the sequences of assembled transcripts as reference only and imported them to IGV (Integrated Genome Viewer) and also imported my BWA alignments in BAM format. It works and I can see where poly-A tails are presens (representing the 3' end of the transcript).
The only glitch is that I can see only one assembled transcript at a time. Anyways.. small trick.. but solved my purpose.