Thursday, May 31, 2012

How to fix the order of paired-end reads

Sometimes paired end reads stored in two fastq files (each corresponding to left and right end reads, respectively), do not appear in the same order in both files. I encountered this problem with the RNA-Seq data sets from ENCODE. In theory, both reads from one insert should appear in the same order. This order is important for downstream alignment tools such as BWA.
To fix the order I have written some scripts that use Picards tools. These scripts can be found on our lab website.
Basic idea is to convert each fastq file to SAM file (with reads marked as unmapped) individually using FastqToSam program. Then merge both SAM files in to one using MergeSamFiles and sort the merged file on read name using SortSam. Once the file is sorted, reads can be extracted easily from the SAM file. For this purpose I had to write a perl script (also available on the link above) since the merged SAM file will be missing proper header and conventional programs such as SamToFastq and BamToFastq won't work here.
Before proceeding with the scripts, please go through the instructions.