Hello readers,
For past couple of months I have been working on de-novo assemblies of human transcriptome (RNA-Seq ) datasets. The goal is to detect the chimeric transcripts that requires alignment of the assembled contigs against the reference genome. This alignment is usually done by BLAT and for a typical mammalian transcriptome assembly, may take up to several weeks to finish.
Long assembled contigs, sluggish performance of BLAT and lack of parallelism in BLAT can can create a computational bottleneck if one is dealing with multiple samples.
One simple way of easing this problem is to reduce the initial number of contigs to align by selecting only the potential chimeric transcripts for the reference genome alignment using BLAT.
This can be easily done by first aligning the assembled contigs to reference genome annnotations such as Gencode or Ensembl reference transcript sequences using MegaBlast. Since MegaBlast can be run on multiple threads and alignment is being done against the whole transcriptome rather than the whole genome, this step should not take more than 5-6 hours on a typical server.
In the Blast output file, query sequences (assembled contigs in this case) representing potential chimeric transcripts will not produce alignment with good query coverage in single alignment hits. I usually select those contigs as potential chimeric transcripts that have <90% of the contig length covered in the alignment and also leaving at least 25 bases unaligned on either or both ends (alignmentStart > 25 and/or (queryLength - alignmentEnd < 25)).
This pre-genomic-alignment filtering will reduce the query data file to 20% of the original file hence the alignment time.
This alignment may be further speed-up if simple and low-complexity end-bases are trimmed from the assembled contigs. It can be done easily using Dustmasker or WindowsMasker. This trimming may also help reducing fasle-positives in chimeric transcript detection.
For past couple of months I have been working on de-novo assemblies of human transcriptome (RNA-Seq ) datasets. The goal is to detect the chimeric transcripts that requires alignment of the assembled contigs against the reference genome. This alignment is usually done by BLAT and for a typical mammalian transcriptome assembly, may take up to several weeks to finish.
Long assembled contigs, sluggish performance of BLAT and lack of parallelism in BLAT can can create a computational bottleneck if one is dealing with multiple samples.
One simple way of easing this problem is to reduce the initial number of contigs to align by selecting only the potential chimeric transcripts for the reference genome alignment using BLAT.
This can be easily done by first aligning the assembled contigs to reference genome annnotations such as Gencode or Ensembl reference transcript sequences using MegaBlast. Since MegaBlast can be run on multiple threads and alignment is being done against the whole transcriptome rather than the whole genome, this step should not take more than 5-6 hours on a typical server.
In the Blast output file, query sequences (assembled contigs in this case) representing potential chimeric transcripts will not produce alignment with good query coverage in single alignment hits. I usually select those contigs as potential chimeric transcripts that have <90% of the contig length covered in the alignment and also leaving at least 25 bases unaligned on either or both ends (alignmentStart > 25 and/or (queryLength - alignmentEnd < 25)).
This pre-genomic-alignment filtering will reduce the query data file to 20% of the original file hence the alignment time.
This alignment may be further speed-up if simple and low-complexity end-bases are trimmed from the assembled contigs. It can be done easily using Dustmasker or WindowsMasker. This trimming may also help reducing fasle-positives in chimeric transcript detection.