Sunday, March 31, 2013

Why short-read alignments are always not reproducible?


Some of you may have noticed that if you re-run a short-read aligner (such as BWA or SOAP) on the same data multiple times then the reported alignments are not identical. Short read alignment is always challenging due to the reference genome complexity and  uncertainty in mapping relatively short read in the repetitive regions. So when the aligner finds a read mapping to multiple genomic loci with equal alignment score, it assigns the read randomly to a single genomic locus. Since this locus is chosen randomly by the aligner, alignments over multiple runs may not be identical.
To overcome this problem, bowtie (as well as bowtie2) follows a "pseudo random" selection procedure. Seed for the randomization is generated using the read name (and other attributes from the Fastq) for each read. It guarantees the exactly same output everytime bowtie is run. User may also specify a seed for the randomization and using the same seed will produce identical alignment from each bowtie run on the same data.