Tuesday, July 22, 2014

How to filter out background noise from the transcriptome assembly? -- Part 3


Dear readers,
Continuing with the new post in the blog series about filtering the low-confidence transcripts (or background noise) from the transcriptome assembly generated using RNA-Seq reads. 

This particular methods is a classification based method that uses recursive partitioning  and regression trees. Applicability of this method for RNA-Seq based transcriptome assembly was originally demonstrated by Prensner et al.. Most of the details are provided in the supplementary methods of the paper and rest were collected by corresponding with the authors of the paper.

Please note that this method is mainly applicable for the reference genome guided transcriptome assembly for RNA-Seq data-sets in a multi-sample experiment.

Working principle:
Known reference transcripts  (such as RefSeq, Ensembl, UCSC transcripts etc..) represent a comprehensive set of high-confidence transcripts that with wide range of expression levels across various tissue and cell types.  Assembled transcripts with structural and expression features resembling with the reference transcripts can be considered as 'genuine' and those that differ represent background noise.

This method is mainly applicable for multi-sample experiments. It is assumed that you have a set of assembled transcripts that have been unified across all the samples  (using programs such as Cuffdiff or Cuffmerge) and then expression value for each transcript is calculated across all the samples (read my previous post on unifying the transcriptome assemblies).

 
Step 1: Reference transcript comparison and classification
Compare the genomic coordinates of assembled transcripts with the known reference transcripts (that are used for the annotation). This comparison can be done by using Cuffdiff, R-SAP or any other similar program.

Step 2: Classification
Classify the assembled transcripts that overlap with the reference transcripts as 'annotated' or 'known' and remaining transcripts as 'un-annotated' or 'unknown'. This overlap van vary from single nucleotide overlap (most lenient) to a complete exon structure match (most stringent).

Step 3:
For each assembled transcripts, collect transcriptomic, genomic and expression features and create a table as shown in the figure below:













Here:
TxId: TCONS1, TCONS2, TCONS3.... are the assembled transcript Ids.
Tx len: Length of the transcript (in base pairs)
# of exons:  Number of exons in the transcript
# of samples: Number of samples the transcript was detected in
Tx expression: Expression (RPKM or FPKM) of the transcript. Since the transcript is detected across multiple samples, one can take the 95th percentile of expression or maximum expression value
Uniqueness:  Uniqueness score (as determined by UCSC genome browser) for the genomic region that harbor the assembled transcript
Outcome: 'annotated' (represented by 1) or 'un-annotated' (represented by 0) as determined in 'Step 2'.

Note: Additional genomic features such as conservation, mappability, GC percent can also be included in the above tables. These measures can be found using the UCSC genome browser tracks.


Step 4: Classification and filtering
In this step assembled transcripts are used as sort of training set that is used to classify the 'un-annotated' transcripts. Classification (or to say more precisely clustering) is done based on the similarity between transcript and genomic features (defined in the table from 'Step 3'). For this classification, a R package called 'rpart' (available here) or 'recursive partitioning and regression trees' is used. Table from 'Step 3' is imported in 'rpart' and classification tree is presented as the output.

In the tree, each node (traversing from root to leaves of the tree) is labeled with the classification criterion. These criteria serve as the cutoff thresholds for calling a transcript genuinely expressed (if clustered withe the 'annotated' transcripts) or background noise (if clustered with the 'un-annotated' transcripts). This classification (or clustering) in rpart is also accompanied by a probability score that can also be used as cutoff for defining a cluster as 'annotated' or 'unannotated' (see the rpart manual here).


Note: As the author suggested in the paper, above filtering should be performed independently for each chromosome in order to account for variability in the gene-density across chromosomes.