Sunday, January 26, 2014

How to filter out background noise from the transcriptome assembly? -- Part 1


Dear readers,
Apologies for not being active on the blog for a while. I have been planning to write on this topic for a long time so here it is:
High-throughput transcriptome studies are typically performed by using short-read (Illumina) RNA-Seq. Apart from the expression estimation of the genes/transcripts, RNA-Seq data is also used for discovering new RNAs such as lincRNAs and novel splice-variants of the known genes. It usually requires transcriptome assembly using Cufflinks or Scripture that are widely used pipelines. Since these pipelines rely upon the alignment of short-reads to the reference genomes, there is a possibility of mis-alignments and hence mis-assemblies. A typical human transcriptome assembly (using Cufflinks) from 80-100 million short paired-end reads will generate 200k-250k assembled fragments ( also known as assembled transcripts) and all of them do not represent true RNA molecules from the cell. So it becomes crucial to exclude potential mis-assemblies before performing any downstream analysis.
Now, there is no definite approach to distinguish between real RNA assembly and mis-assembly without performing PCR (or similar experiment) for each of them; which is, of-course, not feasible. But, over the past few years, experts in the field have proposed and implemented several computational methods to remove potential mis-assemblies that is also referred to as "background" noise. I'll discuss them one by one in this blog (and in following blogs as well) that will specifically talk about removing background noise from transcriptome assemblies (mainly reference genome guided assemblies).
Insert size based filtering:
Sequencing library preparation typically involves a "size-selection" step (following the random fragmentation and cDNA amplification) that keeps RNA fragments of only a particular predefined size (known as insert size). Typically the insert size for Illumina paired-end protocol is 250-400bps . The mean insert length and associated distribution can also be estimated using the RNA-Seq data in question (see my previous blog).
Ideally, if RNA-Seq data has enough depth of coverage, there should not be any assembled fragment smaller than the insert size originally selected during the size-selection step. Short assembled fragments most likely represent assemblies due to the alignment errors or assemblies from the shallowly sequenced genomic regions. Based on this, any assembled fragment that is shorter than the insert size can be discarded. In order to loosen up this stringency, one can choose cutoff to be one (or may be two) standard deviation of the insert size distribution less than the mean insert size.
Here is the reference where above cutoff has been used:  Prensner et al. and Cabili et al.  (see supplementary material).
I'll discuss more on the filtering methods in my next post.. so enjoy till then.