Monday, August 4, 2014

How to filter out background noise from the transcriptome assembly? -- Part 4


Hello readers,
In this post, I'll discuss yet another method to filter out background noise from the RNA-Seq based transcriptome assemblies. This method was originally proposed by Cabili et al. Genes Dev 2011. Thanks to Moran Cabili (main author of the paper) for answering the questions related to the methods and providing details on the statistical analysis performed in the paper.


Working principle:
Mis-alignment of RNA-Seq reads and mis-assembly results in numerous incomplete transcripts (or fragments) that are generally expressed at very low level compared to the 'authentic' or fully assembled transcripts and considered as 'background noise'. One can learn the minimum optimal expression level (or coverage) from the expression of authentic transcripts and filter out all the transcripts (or background noise) with expression levels below this threshold. This threshold is optimized by controlling the rate of 'true positive (TP)' and 'false positive (FP)' using ROC (receiver operating characteristic) curve approach.

This method is mainly applicable for multi-sample experiments. It is assumed that you have a set of assembled transcripts that have been unified across all the samples  (using programs such as Cuffdiff or Cuffmerge) and then expression value for each transcript is calculated across all the samples (read my previous post on unifying the transcriptome assemblies).


Step 1: Reference transcript comparison
Compare the assembled transcripts with the known reference transcripts (that are used for the annotation). This comparison can be done by using Cuffdiff, R-SAP or any other similar program.

Step 2: Full-length and partial-length transcript classification
Based on the comparison, classify assembled transcripts as 'full-length'  and 'partial-length'. 
A full-length transcripts matches a reference transcripts and includes all of the reference transcript's exons. Exact match to the 5'- and 3'- boundary may not be required.  Remaining transcripts that do match partially with the reference transcripts are classified as  partial-length transcripts.
For now, exclude all the assembled transcripts that do not match any of the reference transcripts (i.e. exclude all the inter-genic transcripts).

Step 3: Coverage threshold point selection
For each full-length and partial-length transcript length transcript, determine it's maximal expression (or coverage) across all the samples.
Get all those expressions in a list, sort the list and from the lower end (lowest expression), choose some expression threshold. Selection of the cutoff points is arbitrary so you may have to contemplate a bit.  For example, if transcript expression values range from 1 - 72000, you can start from 2,4,6,10,12,14, 16.......etc. Choose enough number of threshold points to plot the ROC.

Step 4: Statistics calculation
At each expression threshold (let's say 't') selected in the above step, count number of true-positives (TP), false-positive (FP), false-negative (FN) and true-negative (FN) as follows:

TP: Number of full-length transcripts that have expression values higher or equal to 't'
FN: Total number of full length transcripts - TP

FP: Number of partial length transcripts that have expression values higher or equal to 't'
TN: Total number of partial length transcripts - FP

Then calculate true positive rate (TPR) as: TP / (TP + FN)
and false positive rate (FPR) as: FP / (FP + TN)

TPR  = sensitivity (Sn) and
FPR = 1 - specificity (Sp)

Examples on cutoff based sensitivity and specificity calculations are explained with examples at:
http://gim.unmc.edu/dxtests/ROC1.htm  and http://gim.unmc.edu/dxtests/roc2.htm


Results from the above steps can be summarized in a table as follows:


Here, t1,t2,t3.. are coverage thresholds, tpr1,tpr2,tpr3.... and fpr1,fpr2, fpr3... are TPR and FPR at each coverage threshold.





Step 5: ROC curve plotting
Plot TPR (Y-axis) vs. FPR (X-axis) (as shown here).


Step 6: Selecting optimal coverage threshold
Using the ROC curve, for each cutoff, corresponding TPR and FPR can be determined. The objective of plotting the ROC curve is to find the cutoff (or threshold coverage) where TPR is maximum and FPR is minimal. Different ways to infer that 'optimal' threshold using ROC curve are described on page 6 and 7 here). Of those, two straightforward methods are:

a.) Choose the point on the curve which is closest to the coordinate (1,0) i.e. calculate point on the curve for which
  
Where Sn = sensitivity and Sp = specificity


 b.) Use Youden index (J) that is point on ROC curve which is farthest from the line of equality (see on the ROC curve here ; page 6). Basically J will the optimal threshold that maximizes sensitivity and specificity i.e. J = max [Sn + Sp]

Please not that X-axis (FPR) on the ROC curve is 1 - Sp. Hence Sp  =  1 - FPR (third column value from table in the step 4 above)


Step 7: Background filtering

Filter out all the assembled transcripts (including the inter-genic transcripts set aside in step 2) for which the maximal coverage across the samples is below the optimal threshold coverage inferred in the step above.


Note: As author suggested in the paper, protein-coding and non-protein coding transcripts may differ  in overall expression levels. So the calculation of the optimal threshold and following filtering can be done separately for protein-coding and non-protein coding transcripts.