Monday, October 31, 2011

Human Chimpanzee comparative genomic analysis

Last week, our lab published a research article (me as a co-author on the paper) that associates phenotypic differences between human and chimpanzees with the genomic differences, especially in the "so-called" or supposedly "Junk-DNA". Interesting article and got a good media coverage from various news sites including ScienceDaily. Article is published in Mobile DNA and freely accessible. PubMed entry of the article.

Wednesday, October 19, 2011

BowTie2 (gapped aligner for RNA-Seq reads)

I always wondered why short read aligners were not increasing the length of alignable reads given the fact that sequencing technologies are pushing harder day by day to improve the read lengths. Current ultra high-througput machine is TruSeq (from Illumina) that can generate read lengths of 100X100 bp (paired).
BowTie has been traditional short read aligner since it was published in GenomeBiology in 2009 but was unable to performs gapped alignments of long reads. Recently, on October 16, 2011, new version of BowTie (BowTie2) was released that can perform gapped alignments. I haven't started using it but will soon will. Hopefully it will alleviate the slow speed problem in long read alignment such as using BLAT.
BowTie has been the core alignment tool for several popular RNA-Seq pipelines such as Trans-ABySS, TopHat etc. Does the release of new BowTie version affect those pipelines as well?
Will keep you posted if I find anything interesting related to BowTie2.
Thanks for reading.

Monday, October 10, 2011

How to tweak the BLAT code

Lately I've been working on long RNA-Seq reads (454 reads) and have been using BLAT (Blast like alignment tool) by Jim Kent in order to map sequencing reads to the reference genome. I like BLAT for this purpose particularly as it stitches alignment blocks scattered over a span (separated by putative introns mainly) and outputs them as a single "gene-oriented" alignment.
I've been trying to run the BLAT for the whole genome at once (all chromosome sequences in one file) rather than running for each chromosome.
If I run each chromosome separately, at the end of the run I'll have to merge BLAT output files from different chromosomes and sort them on the read ID as I would like to have all the possible genomic hits for each read all together in the file.

Everytime I try to run BLAT like that, it fails to do that (especially for human genome) and terminates with the error "needHugeMen: Out of huge memory - request size 957189248 bytes". I I browsed through the help pages on UCSC genome website (host and support website for the BLAT program) but didn't find anything conclusive. Finally, I started to look at the source code of BLAT itself. Here is the trick/tweak to get around the memory allocation problem.

1. Download the source code of BLAT.
2. Unzip the file and enter the directory.
3. Now open the file lib/memalloc.c
4. Go to line number 76 where "static size_t maxAlloc" is defined.
5. change 128*8*1024*1024*(sizeof(size_t)/4)*(sizeof(size_t)/4) to
(128*8*1024*1024*(sizeof(size_t)/4)*(sizeof(size_t)/4))*2
6. Save and close the file.
7. Now follow the instructions given in BLAT README to compile the edited code and to create the binary files.

This will make the BLAT able to load large genomes at once.