Monday, August 4, 2014

How to filter out background noise from the transcriptome assembly? -- Part 4


Hello readers,
In this post, I'll discuss yet another method to filter out background noise from the RNA-Seq based transcriptome assemblies. This method was originally proposed by Cabili et al. Genes Dev 2011. Thanks to Moran Cabili (main author of the paper) for answering the questions related to the methods and providing details on the statistical analysis performed in the paper.


Working principle:
Mis-alignment of RNA-Seq reads and mis-assembly results in numerous incomplete transcripts (or fragments) that are generally expressed at very low level compared to the 'authentic' or fully assembled transcripts and considered as 'background noise'. One can learn the minimum optimal expression level (or coverage) from the expression of authentic transcripts and filter out all the transcripts (or background noise) with expression levels below this threshold. This threshold is optimized by controlling the rate of 'true positive (TP)' and 'false positive (FP)' using ROC (receiver operating characteristic) curve approach.

This method is mainly applicable for multi-sample experiments. It is assumed that you have a set of assembled transcripts that have been unified across all the samples  (using programs such as Cuffdiff or Cuffmerge) and then expression value for each transcript is calculated across all the samples (read my previous post on unifying the transcriptome assemblies).


Step 1: Reference transcript comparison
Compare the assembled transcripts with the known reference transcripts (that are used for the annotation). This comparison can be done by using Cuffdiff, R-SAP or any other similar program.

Step 2: Full-length and partial-length transcript classification
Based on the comparison, classify assembled transcripts as 'full-length'  and 'partial-length'. 
A full-length transcripts matches a reference transcripts and includes all of the reference transcript's exons. Exact match to the 5'- and 3'- boundary may not be required.  Remaining transcripts that do match partially with the reference transcripts are classified as  partial-length transcripts.
For now, exclude all the assembled transcripts that do not match any of the reference transcripts (i.e. exclude all the inter-genic transcripts).

Step 3: Coverage threshold point selection
For each full-length and partial-length transcript length transcript, determine it's maximal expression (or coverage) across all the samples.
Get all those expressions in a list, sort the list and from the lower end (lowest expression), choose some expression threshold. Selection of the cutoff points is arbitrary so you may have to contemplate a bit.  For example, if transcript expression values range from 1 - 72000, you can start from 2,4,6,10,12,14, 16.......etc. Choose enough number of threshold points to plot the ROC.

Step 4: Statistics calculation
At each expression threshold (let's say 't') selected in the above step, count number of true-positives (TP), false-positive (FP), false-negative (FN) and true-negative (FN) as follows:

TP: Number of full-length transcripts that have expression values higher or equal to 't'
FN: Total number of full length transcripts - TP

FP: Number of partial length transcripts that have expression values higher or equal to 't'
TN: Total number of partial length transcripts - FP

Then calculate true positive rate (TPR) as: TP / (TP + FN)
and false positive rate (FPR) as: FP / (FP + TN)

TPR  = sensitivity (Sn) and
FPR = 1 - specificity (Sp)

Examples on cutoff based sensitivity and specificity calculations are explained with examples at:
http://gim.unmc.edu/dxtests/ROC1.htm  and http://gim.unmc.edu/dxtests/roc2.htm


Results from the above steps can be summarized in a table as follows:


Here, t1,t2,t3.. are coverage thresholds, tpr1,tpr2,tpr3.... and fpr1,fpr2, fpr3... are TPR and FPR at each coverage threshold.





Step 5: ROC curve plotting
Plot TPR (Y-axis) vs. FPR (X-axis) (as shown here).


Step 6: Selecting optimal coverage threshold
Using the ROC curve, for each cutoff, corresponding TPR and FPR can be determined. The objective of plotting the ROC curve is to find the cutoff (or threshold coverage) where TPR is maximum and FPR is minimal. Different ways to infer that 'optimal' threshold using ROC curve are described on page 6 and 7 here). Of those, two straightforward methods are:

a.) Choose the point on the curve which is closest to the coordinate (1,0) i.e. calculate point on the curve for which
  
Where Sn = sensitivity and Sp = specificity


 b.) Use Youden index (J) that is point on ROC curve which is farthest from the line of equality (see on the ROC curve here ; page 6). Basically J will the optimal threshold that maximizes sensitivity and specificity i.e. J = max [Sn + Sp]

Please not that X-axis (FPR) on the ROC curve is 1 - Sp. Hence Sp  =  1 - FPR (third column value from table in the step 4 above)


Step 7: Background filtering

Filter out all the assembled transcripts (including the inter-genic transcripts set aside in step 2) for which the maximal coverage across the samples is below the optimal threshold coverage inferred in the step above.


Note: As author suggested in the paper, protein-coding and non-protein coding transcripts may differ  in overall expression levels. So the calculation of the optimal threshold and following filtering can be done separately for protein-coding and non-protein coding transcripts.








Tuesday, July 22, 2014

How to filter out background noise from the transcriptome assembly? -- Part 3


Dear readers,
Continuing with the new post in the blog series about filtering the low-confidence transcripts (or background noise) from the transcriptome assembly generated using RNA-Seq reads. 

This particular methods is a classification based method that uses recursive partitioning  and regression trees. Applicability of this method for RNA-Seq based transcriptome assembly was originally demonstrated by Prensner et al.. Most of the details are provided in the supplementary methods of the paper and rest were collected by corresponding with the authors of the paper.

Please note that this method is mainly applicable for the reference genome guided transcriptome assembly for RNA-Seq data-sets in a multi-sample experiment.

Working principle:
Known reference transcripts  (such as RefSeq, Ensembl, UCSC transcripts etc..) represent a comprehensive set of high-confidence transcripts that with wide range of expression levels across various tissue and cell types.  Assembled transcripts with structural and expression features resembling with the reference transcripts can be considered as 'genuine' and those that differ represent background noise.

This method is mainly applicable for multi-sample experiments. It is assumed that you have a set of assembled transcripts that have been unified across all the samples  (using programs such as Cuffdiff or Cuffmerge) and then expression value for each transcript is calculated across all the samples (read my previous post on unifying the transcriptome assemblies).

 
Step 1: Reference transcript comparison and classification
Compare the genomic coordinates of assembled transcripts with the known reference transcripts (that are used for the annotation). This comparison can be done by using Cuffdiff, R-SAP or any other similar program.

Step 2: Classification
Classify the assembled transcripts that overlap with the reference transcripts as 'annotated' or 'known' and remaining transcripts as 'un-annotated' or 'unknown'. This overlap van vary from single nucleotide overlap (most lenient) to a complete exon structure match (most stringent).

Step 3:
For each assembled transcripts, collect transcriptomic, genomic and expression features and create a table as shown in the figure below:













Here:
TxId: TCONS1, TCONS2, TCONS3.... are the assembled transcript Ids.
Tx len: Length of the transcript (in base pairs)
# of exons:  Number of exons in the transcript
# of samples: Number of samples the transcript was detected in
Tx expression: Expression (RPKM or FPKM) of the transcript. Since the transcript is detected across multiple samples, one can take the 95th percentile of expression or maximum expression value
Uniqueness:  Uniqueness score (as determined by UCSC genome browser) for the genomic region that harbor the assembled transcript
Outcome: 'annotated' (represented by 1) or 'un-annotated' (represented by 0) as determined in 'Step 2'.

Note: Additional genomic features such as conservation, mappability, GC percent can also be included in the above tables. These measures can be found using the UCSC genome browser tracks.


Step 4: Classification and filtering
In this step assembled transcripts are used as sort of training set that is used to classify the 'un-annotated' transcripts. Classification (or to say more precisely clustering) is done based on the similarity between transcript and genomic features (defined in the table from 'Step 3'). For this classification, a R package called 'rpart' (available here) or 'recursive partitioning and regression trees' is used. Table from 'Step 3' is imported in 'rpart' and classification tree is presented as the output.

In the tree, each node (traversing from root to leaves of the tree) is labeled with the classification criterion. These criteria serve as the cutoff thresholds for calling a transcript genuinely expressed (if clustered withe the 'annotated' transcripts) or background noise (if clustered with the 'un-annotated' transcripts). This classification (or clustering) in rpart is also accompanied by a probability score that can also be used as cutoff for defining a cluster as 'annotated' or 'unannotated' (see the rpart manual here).


Note: As the author suggested in the paper, above filtering should be performed independently for each chromosome in order to account for variability in the gene-density across chromosomes.







Sunday, April 27, 2014

How to filter out background noise from the transcriptome assembly? -- Part 2


Dear readers,
This post is in continuation of my previous blog post about filtering of background (or noisy) transcripts (or fragments) from the set of transcripts assembled using RNA-Seq reads. Previous method was based on estimated insert-size of the paired-end library used for the sequencing.
Following method describes how further noise can be removed from the trascriptome assembly using the assembled transcript structure and structure of the transcripts of the known genes (reference transcripts). Compare reference genome alignments of the assembled transcripts with the reference annotations (such as RefSeq, Ensembl, UCSC, Gencode etc) of the known genes and remove:

1.) Single exon fragments that do not overlap with any of the known (or reference) exons and completely present within introns. Such fragments are typically generated as a result of the sequencing of unspliced premature RNA molecules.

2.) Single exon transcripts mapping completely outside the known gene boundaries. These short fragments are generally resulted due to the assembly of overlapping mis-aligned RNA-Seq reads.
Rationale behind step 1 and 2 is that bona fide transcripts would generate transcripts with multiple exons.
Deep RNA-Seq may also reveal novel exons that are currently not included in the reference annotation. In that case, expression value (RPKM / FPKM or other tag-count method) of the novel exon would be comparable (in the same order of magnitude) to that of neighboring exons and filtering can be done using the expression values in addition (see Moratazavi et al. for more details).

3.) Transcripts mapping to the genes such as ribosomal transcripts, paralogous genes (including pseudo genes) that are known to have multiple copies in the genome. These alignments are typically generated from the mis-aligned hence mis-assembled sequencing reads.

4.) Transcripts mapping close to ( or overlapping with) the unfinished (assembly gaps) or low quality reference genome assembly. Coordinates of such genomic regions for the human genome can be downloaded from UCSC genome database. UCSC genome database also contains defined uniqueness score and mappability score for the reference genome. Uniqueness and mappability scores are defined using the publicly available high-throughput genome and RNA-Seq data. Genomic regions with low scores are typically considered as inaccessible for sequencing and do not produce high-qulity alignments. Hence transcripts mapping to those regions can also be filtered out.

5.)  Low-complexity and simple-repeat rich genomic regions also produce low quality and ambiguous alignment. Hence transcripts overlapping with such regions (as defined in the UCSC genome browser) should also be removed.  Some RNA-Seq studies such as (Lee et al.) are specifically designed to study the expression of repetitive elements (such as transposable elements). In that case, additional caution should be taken.

In my following posts in this series I'll discuss some of the statistical methods to remove background noise based on the expression of the transcripts.




Sunday, February 9, 2014

Functional genomics and gene-symbols

Post- genome/transcriptome analysis, especially in cancer or other disease related study, usually ends up in functional genomics that includes comparing list of mutated or changed (qualitatively or quantitatively) genes with the list of genes from ontologies (GO, GeneGo), canonical pathways (DAVID, KEGG, Wikipathways), gene networks, or genes implicated with disease (COSMIC database).
Now this gene list comparison is typically done by matching gene-symbols (also known as gene identifiers). Gene-symbol comparison can become tricky at times since gene-symbols change due to either change in underlying nomenclature or re-annotation of gene function.  For example, ARHGEF7 can be referred to as KIAA0142, PIXB, DKFZp761K1021, Nbla10314, DKFZp686C12170, BETA-PIX, COOL1, P85SPR, P85, P85COOL1, P50BP, PAK3 or P50.

This problem sounds trivial  but using incorrect or inconsistent gene-symbols can result in the data loss (changed genes) that in turn can confound the accurate biological interpretation of the data.

Now solution to this problem is to always use currently standardized gene-symbols approved and provided by the nomenclature committee which is HGNC (HUGO Gene Nomenclature Committee).
A batch tool converter (http://www.genenames.org/cgi-bin/symbol_checker ) converts older gene-symbols and gene-aliases to their currently used standard version. Alternatively, gene symbols can also be converted to a consistent set using DAVID conversion tool (available at: http://david.abcc.ncifcrf.gov/conversion.jsp)

Also, using Excel (yes.. The Microsoft Excel) for accessing gene-lists can also change gene-symbols.  Supposedly "self sufficiently smart" Excel confuses some of the gene-symbols with calendar dates and converts them to dates permanently. For example, MARCH1 becomes 1-March and SEPT2 becomes 2-September. I have gathered a list (available here) of 34 human gene-symbols that are changed by Excel. This problem has also been described previously by Zeeberg et al
Excel mediated conversion of gene-symbols can be avoided by selecting the column(s) that will contain gene-symbols and then select format as 'text' by right clicking on it.

While comparing the gene-symbol lists, there are some other few things to keep in mind. For example, make sure all the gene symbols are in same case (C8orf4 vs C8ORF) and remove any spaces from within or around the gene-symbols.



Sunday, January 26, 2014

How to filter out background noise from the transcriptome assembly? -- Part 1


Dear readers,
Apologies for not being active on the blog for a while. I have been planning to write on this topic for a long time so here it is:
High-throughput transcriptome studies are typically performed by using short-read (Illumina) RNA-Seq. Apart from the expression estimation of the genes/transcripts, RNA-Seq data is also used for discovering new RNAs such as lincRNAs and novel splice-variants of the known genes. It usually requires transcriptome assembly using Cufflinks or Scripture that are widely used pipelines. Since these pipelines rely upon the alignment of short-reads to the reference genomes, there is a possibility of mis-alignments and hence mis-assemblies. A typical human transcriptome assembly (using Cufflinks) from 80-100 million short paired-end reads will generate 200k-250k assembled fragments ( also known as assembled transcripts) and all of them do not represent true RNA molecules from the cell. So it becomes crucial to exclude potential mis-assemblies before performing any downstream analysis.
Now, there is no definite approach to distinguish between real RNA assembly and mis-assembly without performing PCR (or similar experiment) for each of them; which is, of-course, not feasible. But, over the past few years, experts in the field have proposed and implemented several computational methods to remove potential mis-assemblies that is also referred to as "background" noise. I'll discuss them one by one in this blog (and in following blogs as well) that will specifically talk about removing background noise from transcriptome assemblies (mainly reference genome guided assemblies).
Insert size based filtering:
Sequencing library preparation typically involves a "size-selection" step (following the random fragmentation and cDNA amplification) that keeps RNA fragments of only a particular predefined size (known as insert size). Typically the insert size for Illumina paired-end protocol is 250-400bps . The mean insert length and associated distribution can also be estimated using the RNA-Seq data in question (see my previous blog).
Ideally, if RNA-Seq data has enough depth of coverage, there should not be any assembled fragment smaller than the insert size originally selected during the size-selection step. Short assembled fragments most likely represent assemblies due to the alignment errors or assemblies from the shallowly sequenced genomic regions. Based on this, any assembled fragment that is shorter than the insert size can be discarded. In order to loosen up this stringency, one can choose cutoff to be one (or may be two) standard deviation of the insert size distribution less than the mean insert size.
Here is the reference where above cutoff has been used:  Prensner et al. and Cabili et al.  (see supplementary material).
I'll discuss more on the filtering methods in my next post.. so enjoy till then.