Wednesday, October 2, 2013

FPKM/RPKM normalization caveat and upper quartile normalization


FPKM (fragments per kilo bases of exons for per million mapped reads) or RPKM ( fragments per kilo bases of exons for per million mapped reads) have been one of the most widely used normalization method to estimate the transcript abundance using the RNA-Seq tag counts (number of reads) where tag counts are normalized by the transcript length (exon only) as well as the total number of mappable reads in the sequencing library. But, recent studies have pointed out that this normalization method suffers from an inherent bias for very highly expressed transcripts.

For example, suppose there are two samples in the study: S1 and S2 and S1 has several highly abundant transcripts (highly expressed). By virtue of the nature of RNA-Sequencing protocol, highly expressed transcripts will generate large number of reads and as a result S1 will have overall more total number of reads than S2.

Now if there is a transcripts, lets say T, that is expressed at identical levels in both S1 and S2. But if we were to compare FPKM or RPKM of  T between samples S1 and S2, T would turn out to be down-regulated  in S1 since S1 has higher number of total reads in the and normalizing by that will decrease the expression value of T1 by a big factor.

So normalization by total number of reads can sometimes lead to false detection of differentially expressed genes or transcripts in quantitative comparison studies based on RNA-Seq.

One of the easiest solution for removing such bias is to use Upper quartile normalization that can be performed in some simple steps:

1.) Create a matrix (in an excel sheet or text file), were rows represent genes or transcripts and columns represent different samples. Each cell will contain expression (read count or any other expression metric) of each gene or transcript in each sample from the study.
2.) Remove those genes/transcripts that have 0 expression value in all the samples.

3.) Now for each column (i.e. for each sample):
     a. Sort the expression values in increasing order and find the 75th percentile value (upper quartile).
     b. Divide all the expression values by the upper quartile value

4.) This step is optional:
      Dividing original expression values by a normalization factor will result in very small values that may not    look good or easy to do downstream calculations. So, in order to scale up the normalized expression values, multiply all the expression values in the matrix by mean of upper quartiles from all the samples.

I have summarized the above steps in the formula below:






Here:
j: Transcript (jth), rows in the matrix
i: ith sample, varies from 1 to n where n being the total number of samples in the study (columns in the matrix).
Nnormij:  Normalized and scaled up expression value of jth gene or transcript in ith sample.

Nij:Original expression (raw read count) of jth gene or transcript in ith sample

Di:Upper quartile expression value in for ith sample


Various normalization methods available for the RNA-Seq data are discussed in detail in 
http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long

If you are using Cuffdiff  to estimate FPKM/RPKM then you can also perform upper quartile normalization internally by using the argument "--upper-quartile-norm".

3 comments:

  1. You have any idea about TCHA RNAseqV2 data normalization methodology used? One of the block states that its just upper quartile normalization on raw rsem counts. Can you please clearify?

    ReplyDelete
  2. Hi,
    Not sure what they use these days. I suggest looking into their README (or description file) for the RNA-Seq. They have pretty good explanation of the methods. You usually get this file when you download RNA-Seq data from TCGA. Hope this helps.
    Thanks.

    ReplyDelete

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.