Bioinformatics blog

Thursday, May 31, 2012

How to fix the order of paired-end reads

Sometimes paired end reads stored in two fastq files (each corresponding to left and right end reads, respectively), do not appear in the same order in both files. I encountered this problem with the RNA-Seq data sets from ENCODE. In theory, both reads from one insert should appear in the same order. This order is important for downstream alignment tools such as BWA.
To fix the order I have written some scripts that use Picards tools. These scripts can be found on our lab website.
Basic idea is to convert each fastq file to SAM file (with reads marked as unmapped) individually using FastqToSam program. Then merge both SAM files in to one using MergeSamFiles and sort the merged file on read name using SortSam. Once the file is sorted, reads can be extracted easily from the SAM file. For this purpose I had to write a perl script (also available on the link above) since the merged SAM file will be missing proper header and conventional programs such as SamToFastq and BamToFastq won't work here.
Before proceeding with the scripts, please go through the instructions.

Thursday, April 5, 2012

Human genome annotation resources

For all the the folks who are doing RNA-Seq analysis and looking for a comprehensive annotation (esp. for the human genome), I found a decent table in the Supplementary materials of the paper by Cabili et al. (Integrative Annotation of Human Large Intergenic Non-Coding RNAs Reveals Global Properties and Specific Subclasses, Genes Dev., Sept 2, 2011) pubmed. These annotations prove to be very useful during the filtering of transcripts after the transcriptome assembly. rRNA and tRNA gtf file can be supplied to Cufflinks (using -M option) during the assembly process in order to exclude reads mapping to these RNAs. Exclusion of such reads can improve abundance estimates of other more useful (protein coding, lincRNAs etc) transcripts.
I am pasting the table below:

Tuesday, February 28, 2012

How to estimate insert size for paired-end RNA-Seq data

Reference genome alignment high-throughput RNA-seq reads generated from "paired-end" library often requires "mate-pair insert length" and associated standard deviation for alignment tools such as BowTie, TopHat or BWA to work properly. These values can be obtained from the library prep that was used for the sequencing. But these values are not always accompanied with the data in public databases hence need to be estimated computationally. Here are some simple steps:

Required sowftare: BowTie or BWA, Picard tools.
Reference transcriptome: Reference transcript sequences from the same species as the RNA-Seq data is from. Instead of reference transcript, reference genome can also be used. I prefer reference transcriptome since it saves a magnitude of alignment as well as downstream processing time over the whole genome and also generates more alignments as splice junction reads are easily aligned to the transcript sequence than the reference genome sequence.

Steps:
1.)Create reference transcriptome index for BowTie (or BWA).
2.)Align paired-end RNA-Seq data to the reference transcripts using minimum insert-length as zero and maximum insert-length as 500 (or more).
BowTie: -I 0 -X 500, BWA: -a 500. Set output format as SAM(BowTie) or BAM(BWA).
3.) Sort resulting SAM/BAM file using Picard's SortSam.jar utility as follows:
java -Xmx2g -jar picard-tools/SortSam.jar INPUT= OUTPUT= SORT_ORDER=coordinate
4.) Now run Picard's CollectInsertSizeMetrics.jar utility as follows:
java -Xmx2g -jar picard-tools/CollectInsertSizeMetrics.jar INPUT= HISTOGRAM_FILE= OUTPUT=
5.)Insert-size distribution is printed in output_file while corresponding histogram image file is generated as pdf in histogram_file.
6.)Insert-matrix output_file conain estimated insert-size and corresponding standard deviation. Picard generates two types estimates: "MEDIAN_INSERT_SIZE" and "MEAN_INSERT_SIZE". Anyoneone of them can be used but I prefer the second one since it is estmated after trimming off the outliers in from the original insert-size distribution in order to render normal distribution.

Note: If you plan to run TopHat on the same data aftrwards, set -r as (estimated insert-size-2*read_length) since TopHat requires distance between mate-pairs and not the insert-size.

Monday, February 27, 2012

How to extract paired-end reads from SRA files

SRA(NCBI) stores all the sequencing run as single "sra" or "lite.sra" file. You may want separate files if you want to use the data from paired-end sequencing. When I run SRA toolkit's "fastq-dump" utility on paired-end sequencing SRA files, sometimes I get only one files where all the mate-pairs are stored in one file rather than two or three files.
The solution for the problem is to always run fastq-dump with "--split-3" option. If the experiment is single-end sequencing, only one fastq file will be generated. If it is paired-end sequencing, there may be two or three fastq files.
Two files (with suffix "_1" and "_2") are matched mate-pair read file where as the third one (without any suffix) contains all the reads that do not have any mate-paires (or SRA couldn't resolve mate-paires for them).

Hope my experiences with NCBI SRA data handling help the readership.

Thursday, February 23, 2012

Update in sra toolkit

I just started working on TCGA high-throughput RNA-Seq data from SRA (short read archive). I was trying to convert lite.sra files to fastq format usign sra-toolkit's fastq-dump utility that I downloaded couple of months ago. It seems the older binary files are not working properly on the data from TCGA.
Anyone who is also facing the same problem, please update your sra toolkit binary files to version 2.1.9 that works very well with the TCGA data.

Thursday, February 2, 2012

RNA-seq pipeline

Hello readers,
I've been working on a computational pipeline to analyze RNA-Seq data in order to detect alternatice splicing and chimeric transcripts. It got published last month in Nucleic Acid Research. Please find the paper at:
http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks047.long

The program is freely available at: www.mcdonaldlab.biology.gatech.edu/r-sap.htm.
The pipeline is actively under development so feel free to use it and any suggestions and feedbacks are most welcome.

How to visualize transcript alignments

Sorry readers,
Was very busy with my work so couldn't post any updates.
Lately I've been working with human cancer transcriptome assemblies. I mainly wanted to find out the changes in 3'end of the assembled transcripts as compared with the "reference" (normal transcripts). I aligned my short reads to the Cufflinks assembled transcripts using BWA.
Then the question is how do I visualize the alignments. There are plenty of visualizer to visualize reference genome alignment but couldn't find anything for transcripts (may be didn't search hard enough for it).
Then I though of using the sequences of assembled transcripts as reference only and imported them to IGV (Integrated Genome Viewer) and also imported my BWA alignments in BAM format. It works and I can see where poly-A tails are presens (representing the 3' end of the transcript).
The only glitch is that I can see only one assembled transcript at a time. Anyways.. small trick.. but solved my purpose.

Monday, October 31, 2011

Human Chimpanzee comparative genomic analysis

Last week, our lab published a research article (me as a co-author on the paper) that associates phenotypic differences between human and chimpanzees with the genomic differences, especially in the "so-called" or supposedly "Junk-DNA". Interesting article and got a good media coverage from various news sites including ScienceDaily. Article is published in Mobile DNA and freely accessible. PubMed entry of the article.

Wednesday, October 19, 2011

BowTie2 (gapped aligner for RNA-Seq reads)

I always wondered why short read aligners were not increasing the length of alignable reads given the fact that sequencing technologies are pushing harder day by day to improve the read lengths. Current ultra high-througput machine is TruSeq (from Illumina) that can generate read lengths of 100X100 bp (paired).
BowTie has been traditional short read aligner since it was published in GenomeBiology in 2009 but was unable to performs gapped alignments of long reads. Recently, on October 16, 2011, new version of BowTie (BowTie2) was released that can perform gapped alignments. I haven't started using it but will soon will. Hopefully it will alleviate the slow speed problem in long read alignment such as using BLAT.
BowTie has been the core alignment tool for several popular RNA-Seq pipelines such as Trans-ABySS, TopHat etc. Does the release of new BowTie version affect those pipelines as well?
Will keep you posted if I find anything interesting related to BowTie2.
Thanks for reading.

Monday, October 10, 2011

How to tweak the BLAT code

Lately I've been working on long RNA-Seq reads (454 reads) and have been using BLAT (Blast like alignment tool) by Jim Kent in order to map sequencing reads to the reference genome. I like BLAT for this purpose particularly as it stitches alignment blocks scattered over a span (separated by putative introns mainly) and outputs them as a single "gene-oriented" alignment.

I've been trying to run the BLAT for the whole genome at once (all chromosome sequences in one file) rather than running for each chromosome.

If I run each chromosome separately, at the end of the run I'll have to merge BLAT output files from different chromosomes and sort them on the read ID as I would like to have all the possible genomic hits for each read all together in the file.

Everytime I try to run BLAT like that, it fails to do that (especially for human genome) and terminates with the error "needHugeMen: Out of huge memory - request size 957189248 bytes". I I browsed through the help pages on UCSC genome website (host and support website for the BLAT program) but didn't find anything conclusive. Finally, I started to look at the source code of BLAT itself. Here is the trick/tweak to get around the memory allocation problem.

1. Download the source code of BLAT.

2. Unzip the file and enter the directory.

3. Now open the file lib/memalloc.c

4. Go to line number 76 where "static size_t maxAlloc" is defined.

5. change 128*8*1024*1024*(sizeof(size_t)/4)*(sizeof(size_t)/4) to

(128*8*1024*1024*(sizeof(size_t)/4)*(sizeof(size_t)/4))*2

6. Save and close the file.

7. Now follow the instructions given in BLAT README to compile the edited code and to create the binary files.

This will make the BLAT able to load large genomes at once.

Thursday, September 29, 2011

My story

Dear readers,

I am a graduate student of bioinformatics and keenly interested in high-throughput sequencing technologies and data analysis. My active area of research is the applications of high-throughput transcriptome and genome sequencing for the cancer research.

I like to play around with the variety of functional genomics data and available tools to analyze them, develop integrated analysis tool to maximize the relevant information.

During my research, everyday I come across several trivial but, to my surprise, tricky problems. Sometimes I am able to solve them in a straightforward manner and sometimes with little tweaks and manipulations in the traditional methods. I would like to share those problems and issue with the readership....I'll keep posting them.

Will be back soon.... enjoy till then.