Sunday, April 27, 2014

How to filter out background noise from the transcriptome assembly? -- Part 2


Dear readers,
This post is in continuation of my previous blog post about filtering of background (or noisy) transcripts (or fragments) from the set of transcripts assembled using RNA-Seq reads. Previous method was based on estimated insert-size of the paired-end library used for the sequencing.
Following method describes how further noise can be removed from the trascriptome assembly using the assembled transcript structure and structure of the transcripts of the known genes (reference transcripts). Compare reference genome alignments of the assembled transcripts with the reference annotations (such as RefSeq, Ensembl, UCSC, Gencode etc) of the known genes and remove:

1.) Single exon fragments that do not overlap with any of the known (or reference) exons and completely present within introns. Such fragments are typically generated as a result of the sequencing of unspliced premature RNA molecules.

2.) Single exon transcripts mapping completely outside the known gene boundaries. These short fragments are generally resulted due to the assembly of overlapping mis-aligned RNA-Seq reads.
Rationale behind step 1 and 2 is that bona fide transcripts would generate transcripts with multiple exons.
Deep RNA-Seq may also reveal novel exons that are currently not included in the reference annotation. In that case, expression value (RPKM / FPKM or other tag-count method) of the novel exon would be comparable (in the same order of magnitude) to that of neighboring exons and filtering can be done using the expression values in addition (see Moratazavi et al. for more details).

3.) Transcripts mapping to the genes such as ribosomal transcripts, paralogous genes (including pseudo genes) that are known to have multiple copies in the genome. These alignments are typically generated from the mis-aligned hence mis-assembled sequencing reads.

4.) Transcripts mapping close to ( or overlapping with) the unfinished (assembly gaps) or low quality reference genome assembly. Coordinates of such genomic regions for the human genome can be downloaded from UCSC genome database. UCSC genome database also contains defined uniqueness score and mappability score for the reference genome. Uniqueness and mappability scores are defined using the publicly available high-throughput genome and RNA-Seq data. Genomic regions with low scores are typically considered as inaccessible for sequencing and do not produce high-qulity alignments. Hence transcripts mapping to those regions can also be filtered out.

5.)  Low-complexity and simple-repeat rich genomic regions also produce low quality and ambiguous alignment. Hence transcripts overlapping with such regions (as defined in the UCSC genome browser) should also be removed.  Some RNA-Seq studies such as (Lee et al.) are specifically designed to study the expression of repetitive elements (such as transposable elements). In that case, additional caution should be taken.

In my following posts in this series I'll discuss some of the statistical methods to remove background noise based on the expression of the transcripts.