Bioinformatics blog: February 2012

Tuesday, February 28, 2012

How to estimate insert size for paired-end RNA-Seq data

Reference genome alignment high-throughput RNA-seq reads generated from "paired-end" library often requires "mate-pair insert length" and associated standard deviation for alignment tools such as BowTie, TopHat or BWA to work properly. These values can be obtained from the library prep that was used for the sequencing. But these values are not always accompanied with the data in public databases hence need to be estimated computationally. Here are some simple steps:

Required sowftare: BowTie or BWA, Picard tools.
Reference transcriptome: Reference transcript sequences from the same species as the RNA-Seq data is from. Instead of reference transcript, reference genome can also be used. I prefer reference transcriptome since it saves a magnitude of alignment as well as downstream processing time over the whole genome and also generates more alignments as splice junction reads are easily aligned to the transcript sequence than the reference genome sequence.

Steps:
1.)Create reference transcriptome index for BowTie (or BWA).
2.)Align paired-end RNA-Seq data to the reference transcripts using minimum insert-length as zero and maximum insert-length as 500 (or more).
BowTie: -I 0 -X 500, BWA: -a 500. Set output format as SAM(BowTie) or BAM(BWA).
3.) Sort resulting SAM/BAM file using Picard's SortSam.jar utility as follows:
java -Xmx2g -jar picard-tools/SortSam.jar INPUT= OUTPUT= SORT_ORDER=coordinate
4.) Now run Picard's CollectInsertSizeMetrics.jar utility as follows:
java -Xmx2g -jar picard-tools/CollectInsertSizeMetrics.jar INPUT= HISTOGRAM_FILE= OUTPUT=
5.)Insert-size distribution is printed in output_file while corresponding histogram image file is generated as pdf in histogram_file.
6.)Insert-matrix output_file conain estimated insert-size and corresponding standard deviation. Picard generates two types estimates: "MEDIAN_INSERT_SIZE" and "MEAN_INSERT_SIZE". Anyoneone of them can be used but I prefer the second one since it is estmated after trimming off the outliers in from the original insert-size distribution in order to render normal distribution.

Note: If you plan to run TopHat on the same data aftrwards, set -r as (estimated insert-size-2*read_length) since TopHat requires distance between mate-pairs and not the insert-size.

Monday, February 27, 2012

How to extract paired-end reads from SRA files

SRA(NCBI) stores all the sequencing run as single "sra" or "lite.sra" file. You may want separate files if you want to use the data from paired-end sequencing. When I run SRA toolkit's "fastq-dump" utility on paired-end sequencing SRA files, sometimes I get only one files where all the mate-pairs are stored in one file rather than two or three files.
The solution for the problem is to always run fastq-dump with "--split-3" option. If the experiment is single-end sequencing, only one fastq file will be generated. If it is paired-end sequencing, there may be two or three fastq files.
Two files (with suffix "_1" and "_2") are matched mate-pair read file where as the third one (without any suffix) contains all the reads that do not have any mate-paires (or SRA couldn't resolve mate-paires for them).

Hope my experiences with NCBI SRA data handling help the readership.

Thursday, February 23, 2012

Update in sra toolkit

I just started working on TCGA high-throughput RNA-Seq data from SRA (short read archive). I was trying to convert lite.sra files to fastq format usign sra-toolkit's fastq-dump utility that I downloaded couple of months ago. It seems the older binary files are not working properly on the data from TCGA.
Anyone who is also facing the same problem, please update your sra toolkit binary files to version 2.1.9 that works very well with the TCGA data.

Thursday, February 2, 2012

RNA-seq pipeline

Hello readers,
I've been working on a computational pipeline to analyze RNA-Seq data in order to detect alternatice splicing and chimeric transcripts. It got published last month in Nucleic Acid Research. Please find the paper at:
http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks047.long

The program is freely available at: www.mcdonaldlab.biology.gatech.edu/r-sap.htm.
The pipeline is actively under development so feel free to use it and any suggestions and feedbacks are most welcome.

How to visualize transcript alignments

Sorry readers,
Was very busy with my work so couldn't post any updates.
Lately I've been working with human cancer transcriptome assemblies. I mainly wanted to find out the changes in 3'end of the assembled transcripts as compared with the "reference" (normal transcripts). I aligned my short reads to the Cufflinks assembled transcripts using BWA.
Then the question is how do I visualize the alignments. There are plenty of visualizer to visualize reference genome alignment but couldn't find anything for transcripts (may be didn't search hard enough for it).
Then I though of using the sequences of assembled transcripts as reference only and imported them to IGV (Integrated Genome Viewer) and also imported my BWA alignments in BAM format. It works and I can see where poly-A tails are presens (representing the 3' end of the transcript).
The only glitch is that I can see only one assembled transcript at a time. Anyways.. small trick.. but solved my purpose.

Bioinformatics blog