Friday, September 18, 2015

Computational biologist's lab notebook

Schnell lab from University of Michigan recently published an interesting editorial in PLOS Comp. Bio. (article link). It covers various aspect of a typical bioinformatics project especially in life science research environment. The article is written in a way to make it appear like wet-lab protocol and rule book. My personal favorites are rule #4 and rule #5 about not only recording all scientific activity (e.g. what bioinformatics programs / code were used, data sources, data references, parameters used etc.) but also keeping the track of in a date wise fashion. This rule is basically referring to version control in a code development setting. Give it read... interesting article.

Ten Simple Rules for a Computational Biologist’s Laboratory Notebook (Santiago Schnell)

Monday, August 24, 2015

Another RNA-Seq tool from Lior Pachter's lab: Sleuth

New RNA-Seq based expression estimation program, sleuth', was released today from Lior Pachter's group. This one challenges currently used 'count-based' methods for RNA-Seq data. Sleuth also accounts for biological and technical variation.
He raises a few good points that RNA-Seq based expression estimation becomes challenging since there is a biological variation of transcription within and between cells. Also, sometimes technical variations in a repeated sequencing experiment can overshadow the 'true' expression level changes.

See more details here:
https://liorpachter.wordpress.com/2015/08/17/a-sleuth-for-rna-seq/

software download: http://pachterlab.github.io/sleuth/

Wednesday, July 22, 2015

RNA-Seq expression metrics: FPKM, RPKM and TMP explained by StatQuest

FPKM / RPKM and TPM are one of the most widely used transcript and gene expression normalization methods in RNA-Seq studies. FPKM and RPKM are similar metrics except the former applies to paired-end sequencing while the latter one is for single-end data. TMP is bit different as it is calculated using a different order of operations. StatQuest explains it well in a nutshell (please follow the video below and read on the following link):

For text summary please refer to: StatQuest page

New ovarian cancer study on large structural variants based on TCGA data is out

We have recently published a study on ovarian cancer using whole genome sequence (control + tumor), RNA-Seq and microarray gene expression chip data from the same patients. Data was acquired from TCGA. The paper titles:

"Integrated sequence and expression analysis of ovarian cancer structural variants underscores the importance of gene fusion regulation" (BMC medical genomics, Pubmed)

Here is the brief summary:

First we designed an integrated bioinformatics workflow (Figure S1) to process and analyze the large volumes of genomic and transcriptomic data, and to detect structural variants (SVs) breakpoint accurately at the single nucleotide resolution.
Using whole genome sequencing (WGS) data we first detected various classes of SVs and further classify them as 'germline derived' and 'somatically derived'.
Then based on their structure and underlyong genomic regions, determined their potential to create functional gene-fusions at the RNA level.
Using RNA-Seq and micraoarray data, we measured the proportion potential gene-fusion forming SVs that actually get transcribed.
The observations suggests existence of regulatory mechanism(s) that suppress the expression of more established germline SVs (could be segregating in natural population) but facilitates the selected somatically derived SVs at the RNA level in ovarian tumors.
Our findings resonate with the observations of Bueno et al. (Pubmed) that the expression of BCR-ABL gene fusion, a very well known tumor driver in non-solid tumors, can be regulated by the genetic and epigenetic silencing of miR-203

Our findings are also relevant to the fact that recent studies have found several gene-fusions that are considered as cancer biomarkers in healthy individuals. Simply put, not the ocurrance of SVs at the genomic level but the regulation of the expression of SVs at RNA level contributes to their biological and clinical significance in the onset and progression of cancer.

Feel free to contact me if have any questions and need some more details about from the paper.

Monday, August 4, 2014

How to filter out background noise from the transcriptome assembly? -- Part 4

(Previous post in this series)

Hello readers,
In this post, I'll discuss yet another method to filter out background noise from the RNA-Seq based transcriptome assemblies. This method was originally proposed by Cabili et al. Genes Dev 2011. Thanks to Moran Cabili (main author of the paper) for answering the questions related to the methods and providing details on the statistical analysis performed in the paper.

Working principle:
Mis-alignment of RNA-Seq reads and mis-assembly results in numerous incomplete transcripts (or fragments) that are generally expressed at very low level compared to the 'authentic' or fully assembled transcripts and considered as 'background noise'. One can learn the minimum optimal expression level (or coverage) from the expression of authentic transcripts and filter out all the transcripts (or background noise) with expression levels below this threshold. This threshold is optimized by controlling the rate of 'true positive (TP)' and 'false positive (FP)' using ROC (receiver operating characteristic) curve approach.

This method is mainly applicable for multi-sample experiments. It is assumed that you have a set of assembled transcripts that have been unified across all the samples (using programs such as Cuffdiff or Cuffmerge) and then expression value for each transcript is calculated across all the samples (read my previous post on unifying the transcriptome assemblies).

Step 1: Reference transcript comparison
Compare the assembled transcripts with the known reference transcripts (that are used for the annotation). This comparison can be done by using Cuffdiff, R-SAP or any other similar program.

Step 2: Full-length and partial-length transcript classification
Based on the comparison, classify assembled transcripts as 'full-length' and 'partial-length'.
A full-length transcripts matches a reference transcripts and includes all of the reference transcript's exons. Exact match to the 5'- and 3'- boundary may not be required. Remaining transcripts that do match partially with the reference transcripts are classified as partial-length transcripts.
For now, exclude all the assembled transcripts that do not match any of the reference transcripts (i.e. exclude all the inter-genic transcripts).

Step 3: Coverage threshold point selection
For each full-length and partial-length transcript length transcript, determine it's maximal expression (or coverage) across all the samples.
Get all those expressions in a list, sort the list and from the lower end (lowest expression), choose some expression threshold. Selection of the cutoff points is arbitrary so you may have to contemplate a bit. For example, if transcript expression values range from 1 - 72000, you can start from 2,4,6,10,12,14, 16.......etc. Choose enough number of threshold points to plot the ROC.

Step 4: Statistics calculation
At each expression threshold (let's say 't') selected in the above step, count number of true-positives (TP), false-positive (FP), false-negative (FN) and true-negative (FN) as follows:

TP: Number of full-length transcripts that have expression values higher or equal to 't'
FN: Total number of full length transcripts - TP

FP: Number of partial length transcripts that have expression values higher or equal to 't'
TN: Total number of partial length transcripts - FP

Then calculate true positive rate (TPR) as: TP / (TP + FN)
and false positive rate (FPR) as: FP / (FP + TN)

TPR = sensitivity (Sn) and
FPR = 1 - specificity (Sp)

Examples on cutoff based sensitivity and specificity calculations are explained with examples at:
http://gim.unmc.edu/dxtests/ROC1.htm and http://gim.unmc.edu/dxtests/roc2.htm

Results from the above steps can be summarized in a table as follows:

Here, t1,t2,t3.. are coverage thresholds, tpr1,tpr2,tpr3.... and fpr1,fpr2, fpr3... are TPR and FPR at each coverage threshold.

Step 5: ROC curve plotting
Plot TPR (Y-axis) vs. FPR (X-axis) (as shown here).

Step 6: Selecting optimal coverage threshold
Using the ROC curve, for each cutoff, corresponding TPR and FPR can be determined. The objective of plotting the ROC curve is to find the cutoff (or threshold coverage) where TPR is maximum and FPR is minimal. Different ways to infer that 'optimal' threshold using ROC curve are described on page 6 and 7 here). Of those, two straightforward methods are:

a.) Choose the point on the curve which is closest to the coordinate (1,0) i.e. calculate point on the curve for which

Where Sn = sensitivity and Sp = specificity

b.) Use Youden index (J) that is point on ROC curve which is farthest from the line of equality (see on the ROC curve here ; page 6). Basically J will the optimal threshold that maximizes sensitivity and specificity i.e. J = max [Sn + Sp]

Please not that X-axis (FPR) on the ROC curve is 1 - Sp. Hence Sp = 1 - FPR (third column value from table in the step 4 above)

Step 7: Background filtering

Filter out all the assembled transcripts (including the inter-genic transcripts set aside in step 2) for which the maximal coverage across the samples is below the optimal threshold coverage inferred in the step above.

Note: As author suggested in the paper, protein-coding and non-protein coding transcripts may differ in overall expression levels. So the calculation of the optimal threshold and following filtering can be done separately for protein-coding and non-protein coding transcripts.

Tuesday, July 22, 2014

How to filter out background noise from the transcriptome assembly? -- Part 3

(Previous post in this series) (Next post in this series)

Dear readers,

Continuing with the new post in the blog series about filtering the low-confidence transcripts (or background noise) from the transcriptome assembly generated using RNA-Seq reads.

This particular methods is a classification based method that uses recursive partitioning and regression trees. Applicability of this method for RNA-Seq based transcriptome assembly was originally demonstrated by Prensner et al.. Most of the details are provided in the supplementary methods of the paper and rest were collected by corresponding with the authors of the paper.

Please note that this method is mainly applicable for the reference genome guided transcriptome assembly for RNA-Seq data-sets in a multi-sample experiment.

Working principle:

Known reference transcripts (such as RefSeq, Ensembl, UCSC transcripts etc..) represent a comprehensive set of high-confidence transcripts that with wide range of expression levels across various tissue and cell types. Assembled transcripts with structural and expression features resembling with the reference transcripts can be considered as 'genuine' and those that differ represent background noise.

This method is mainly applicable for multi-sample experiments. It is assumed that you have a set of assembled transcripts that have been unified across all the samples (using programs such as Cuffdiff or Cuffmerge) and then expression value for each transcript is calculated across all the samples (read my previous post on unifying the transcriptome assemblies).

Step 1: Reference transcript comparison and classification

Compare the genomic coordinates of assembled transcripts with the known reference transcripts (that are used for the annotation). This comparison can be done by using Cuffdiff, R-SAP or any other similar program.

Step 2: Classification
Classify the assembled transcripts that overlap with the reference transcripts as 'annotated' or 'known' and remaining transcripts as 'un-annotated' or 'unknown'. This overlap van vary from single nucleotide overlap (most lenient) to a complete exon structure match (most stringent).

Step 3:
For each assembled transcripts, collect transcriptomic, genomic and expression features and create a table as shown in the figure below:

Here:
TxId: TCONS1, TCONS2, TCONS3.... are the assembled transcript Ids.
Tx len: Length of the transcript (in base pairs)
# of exons: Number of exons in the transcript
# of samples: Number of samples the transcript was detected in
Tx expression: Expression (RPKM or FPKM) of the transcript. Since the transcript is detected across multiple samples, one can take the 95th percentile of expression or maximum expression value
Uniqueness: Uniqueness score (as determined by UCSC genome browser) for the genomic region that harbor the assembled transcript
Outcome: 'annotated' (represented by 1) or 'un-annotated' (represented by 0) as determined in 'Step 2'.

Note: Additional genomic features such as conservation, mappability, GC percent can also be included in the above tables. These measures can be found using the UCSC genome browser tracks.

Step 4: Classification and filtering
In this step assembled transcripts are used as sort of training set that is used to classify the 'un-annotated' transcripts. Classification (or to say more precisely clustering) is done based on the similarity between transcript and genomic features (defined in the table from 'Step 3'). For this classification, a R package called 'rpart' (available here) or 'recursive partitioning and regression trees' is used. Table from 'Step 3' is imported in 'rpart' and classification tree is presented as the output.

In the tree, each node (traversing from root to leaves of the tree) is labeled with the classification criterion. These criteria serve as the cutoff thresholds for calling a transcript genuinely expressed (if clustered withe the 'annotated' transcripts) or background noise (if clustered with the 'un-annotated' transcripts). This classification (or clustering) in rpart is also accompanied by a probability score that can also be used as cutoff for defining a cluster as 'annotated' or 'unannotated' (see the rpart manual here).

Note: As the author suggested in the paper, above filtering should be performed independently for each chromosome in order to account for variability in the gene-density across chromosomes.

(Next post in this series)

Sunday, April 27, 2014

How to filter out background noise from the transcriptome assembly? -- Part 2

(Previous post in this series) (Next post in this series)

Dear readers,
This post is in continuation of my previous blog post about filtering of background (or noisy) transcripts (or fragments) from the set of transcripts assembled using RNA-Seq reads. Previous method was based on estimated insert-size of the paired-end library used for the sequencing.
Following method describes how further noise can be removed from the trascriptome assembly using the assembled transcript structure and structure of the transcripts of the known genes (reference transcripts). Compare reference genome alignments of the assembled transcripts with the reference annotations (such as RefSeq, Ensembl, UCSC, Gencode etc) of the known genes and remove:

1.) Single exon fragments that do not overlap with any of the known (or reference) exons and completely present within introns. Such fragments are typically generated as a result of the sequencing of unspliced premature RNA molecules.

2.) Single exon transcripts mapping completely outside the known gene boundaries. These short fragments are generally resulted due to the assembly of overlapping mis-aligned RNA-Seq reads.
Rationale behind step 1 and 2 is that bona fide transcripts would generate transcripts with multiple exons.
Deep RNA-Seq may also reveal novel exons that are currently not included in the reference annotation. In that case, expression value (RPKM / FPKM or other tag-count method) of the novel exon would be comparable (in the same order of magnitude) to that of neighboring exons and filtering can be done using the expression values in addition (see Moratazavi et al. for more details).

3.) Transcripts mapping to the genes such as ribosomal transcripts, paralogous genes (including pseudo genes) that are known to have multiple copies in the genome. These alignments are typically generated from the mis-aligned hence mis-assembled sequencing reads.

4.) Transcripts mapping close to ( or overlapping with) the unfinished (assembly gaps) or low quality reference genome assembly. Coordinates of such genomic regions for the human genome can be downloaded from UCSC genome database. UCSC genome database also contains defined uniqueness score and mappability score for the reference genome. Uniqueness and mappability scores are defined using the publicly available high-throughput genome and RNA-Seq data. Genomic regions with low scores are typically considered as inaccessible for sequencing and do not produce high-qulity alignments. Hence transcripts mapping to those regions can also be filtered out.

5.) Low-complexity and simple-repeat rich genomic regions also produce low quality and ambiguous alignment. Hence transcripts overlapping with such regions (as defined in the UCSC genome browser) should also be removed. Some RNA-Seq studies such as (Lee et al.) are specifically designed to study the expression of repetitive elements (such as transposable elements). In that case, additional caution should be taken.

In my following posts in this series I'll discuss some of the statistical methods to remove background noise based on the expression of the transcripts.

(Next post in this series)

Sunday, February 9, 2014

Functional genomics and gene-symbols

Post- genome/transcriptome analysis, especially in cancer or other disease related study, usually ends up in functional genomics that includes comparing list of mutated or changed (qualitatively or quantitatively) genes with the list of genes from ontologies (GO, GeneGo), canonical pathways (DAVID, KEGG, Wikipathways), gene networks, or genes implicated with disease (COSMIC database).
Now this gene list comparison is typically done by matching gene-symbols (also known as gene identifiers). Gene-symbol comparison can become tricky at times since gene-symbols change due to either change in underlying nomenclature or re-annotation of gene function. For example, ARHGEF7 can be referred to as KIAA0142, PIXB, DKFZp761K1021, Nbla10314, DKFZp686C12170, BETA-PIX, COOL1, P85SPR, P85, P85COOL1, P50BP, PAK3 or P50.

This problem sounds trivial but using incorrect or inconsistent gene-symbols can result in the data loss (changed genes) that in turn can confound the accurate biological interpretation of the data.

Now solution to this problem is to always use currently standardized gene-symbols approved and provided by the nomenclature committee which is HGNC (HUGO Gene Nomenclature Committee).
A batch tool converter (http://www.genenames.org/cgi-bin/symbol_checker ) converts older gene-symbols and gene-aliases to their currently used standard version. Alternatively, gene symbols can also be converted to a consistent set using DAVID conversion tool (available at: http://david.abcc.ncifcrf.gov/conversion.jsp)

Also, using Excel (yes.. The Microsoft Excel) for accessing gene-lists can also change gene-symbols. Supposedly "self sufficiently smart" Excel confuses some of the gene-symbols with calendar dates and converts them to dates permanently. For example, MARCH1 becomes 1-March and SEPT2 becomes 2-September. I have gathered a list (available here) of 34 human gene-symbols that are changed by Excel. This problem has also been described previously by Zeeberg et al.
Excel mediated conversion of gene-symbols can be avoided by selecting the column(s) that will contain gene-symbols and then select format as 'text' by right clicking on it.

While comparing the gene-symbol lists, there are some other few things to keep in mind. For example, make sure all the gene symbols are in same case (C8orf4 vs C8ORF) and remove any spaces from within or around the gene-symbols.

Sunday, January 26, 2014

How to filter out background noise from the transcriptome assembly? -- Part 1

Dear readers,
Apologies for not being active on the blog for a while. I have been planning to write on this topic for a long time so here it is:

High-throughput transcriptome studies are typically performed by using short-read (Illumina) RNA-Seq. Apart from the expression estimation of the genes/transcripts, RNA-Seq data is also used for discovering new RNAs such as lincRNAs and novel splice-variants of the known genes. It usually requires transcriptome assembly using Cufflinks or Scripture that are widely used pipelines. Since these pipelines rely upon the alignment of short-reads to the reference genomes, there is a possibility of mis-alignments and hence mis-assemblies. A typical human transcriptome assembly (using Cufflinks) from 80-100 million short paired-end reads will generate 200k-250k assembled fragments ( also known as assembled transcripts) and all of them do not represent true RNA molecules from the cell. So it becomes crucial to exclude potential mis-assemblies before performing any downstream analysis.
Now, there is no definite approach to distinguish between real RNA assembly and mis-assembly without performing PCR (or similar experiment) for each of them; which is, of-course, not feasible. But, over the past few years, experts in the field have proposed and implemented several computational methods to remove potential mis-assemblies that is also referred to as "background" noise. I'll discuss them one by one in this blog (and in following blogs as well) that will specifically talk about removing background noise from transcriptome assemblies (mainly reference genome guided assemblies).

Insert size based filtering:
Sequencing library preparation typically involves a "size-selection" step (following the random fragmentation and cDNA amplification) that keeps RNA fragments of only a particular predefined size (known as insert size). Typically the insert size for Illumina paired-end protocol is 250-400bps . The mean insert length and associated distribution can also be estimated using the RNA-Seq data in question (see my previous blog).
Ideally, if RNA-Seq data has enough depth of coverage, there should not be any assembled fragment smaller than the insert size originally selected during the size-selection step. Short assembled fragments most likely represent assemblies due to the alignment errors or assemblies from the shallowly sequenced genomic regions. Based on this, any assembled fragment that is shorter than the insert size can be discarded. In order to loosen up this stringency, one can choose cutoff to be one (or may be two) standard deviation of the insert size distribution less than the mean insert size.

Here is the reference where above cutoff has been used: Prensner et al. and Cabili et al. (see supplementary material).

I'll discuss more on the filtering methods in my next post.. so enjoy till then.

(next post in this series)

Tuesday, November 12, 2013

Difference between Cuffcompare and Cuffmerge

Cufflinks is one of the most commonly used program for reference-genome based transcriptome assembly, and expression estimation and differential expression analysis. Cufflinks comes with two supplementary programs (in addition to few others) for post cufflinks workflow:: Cuffcompare and Cuffmerge.

Although Cuffcompare and Cuffmerge may seem to perform same task when it comes to handling multiple transcriptome assemblie, there are sitll substantial differences among the two.Cufflinks manual, Tuxedo pipeline (Bowtie-TopHat-Cufflinks-Cuffdiff) paper and forum posts from the developers have pointed out these differences but still I get a lot of questions regarding the same. I will try to explain differences to make it more clear.

Cuffcompare and Cuffmerge both are used to merge multiple transcript assemblies but in a little different manner.
Cuffcompare takes all the transcripts from multiple assemblies (in GTF format) and creates a union of all the transcripts where all the redundant transcripts are removed. Cuffcompare does not change any of the assembled transcript in any of the assembly instead it simply compares the coordinates of the transcripts.
Resulting file, "combined.gtf", contains a set of "unified" transcripts across all the assemblies. The "combined.gtf" file can be used as the reference GTF file for the quantification across the samples using Cuffdiff (another program in Cufflinks toolkit).

Cuffmerge, on the other hand, creates a "merged" set of transcripts form multiple assemblies. During this merging transcripts from all the assemblies (GTF files) are converted to representative reads in SAM format and Cufflinks (original assembly program) is run internally to see of there is any gaps that can be filled and a longer consensus sequence can be created. Basically, Cuffmerge merges transcripts that are overlapping and share a similar exon structure (or splicing structure) to generate a longer chain of connected exons.

Overall, Cuffcompare will generate a non-redundant set of transcripts while Cuffmerge will generate a more consensus assembly form a multiple set of assemblies. So from Cuffmerge you get a cleaner, somewhat more complete assembly and ,generally, fewer number of assembled transcript as compared to the transcripts from Cuffcompare.

Additional note:
Cuffcompare is a more comprehensive program than simply a tool to combine assemblies. For example, ".trackinbg" file generated by Cuffcompare contains the information about how many samples each transcripts was present so that you can the idea of multiplicity (recurrence) of each transcript across multiple samples. Cuffcompare can also annotate your transcript assemblies using a reference annotation files (in GTF format) and will assign reference transcript Id( such as ensemnbl id) and gene symbol to the assembled transcripts.

Wednesday, October 2, 2013

FPKM/RPKM normalization caveat and upper quartile normalization

FPKM (fragments per kilo bases of exons for per million mapped reads) or RPKM ( fragments per kilo bases of exons for per million mapped reads) have been one of the most widely used normalization method to estimate the transcript abundance using the RNA-Seq tag counts (number of reads) where tag counts are normalized by the transcript length (exon only) as well as the total number of mappable reads in the sequencing library. But, recent studies have pointed out that this normalization method suffers from an inherent bias for very highly expressed transcripts.

For example, suppose there are two samples in the study: S1 and S2 and S1 has several highly abundant transcripts (highly expressed). By virtue of the nature of RNA-Sequencing protocol, highly expressed transcripts will generate large number of reads and as a result S1 will have overall more total number of reads than S2.

Now if there is a transcripts, lets say T, that is expressed at identical levels in both S1 and S2. But if we were to compare FPKM or RPKM of T between samples S1 and S2, T would turn out to be down-regulated in S1 since S1 has higher number of total reads in the and normalizing by that will decrease the expression value of T1 by a big factor.

So normalization by total number of reads can sometimes lead to false detection of differentially expressed genes or transcripts in quantitative comparison studies based on RNA-Seq.

One of the easiest solution for removing such bias is to use Upper quartile normalization that can be performed in some simple steps:

1.) Create a matrix (in an excel sheet or text file), were rows represent genes or transcripts and columns represent different samples. Each cell will contain expression (read count or any other expression metric) of each gene or transcript in each sample from the study.
2.) Remove those genes/transcripts that have 0 expression value in all the samples.

3.) Now for each column (i.e. for each sample):
     a. Sort the expression values in increasing order and find the 75th percentile value (upper quartile).
     b. Divide all the expression values by the upper quartile value

4.) This step is optional:
      Dividing original expression values by a normalization factor will result in very small values that may not    look good or easy to do downstream calculations. So, in order to scale up the normalized expression values, multiply all the expression values in the matrix by mean of upper quartiles from all the samples.

I have summarized the above steps in the formula below:

Here:
j: Transcript (jth), rows in the matrix

i: ith sample, varies from 1 to n where n being the total number of samples in the study (columns in the matrix).

Nnormij: Normalized and scaled up expression value of jth gene or transcript in ith sample.

Nij:Original expression (raw read count) of jth gene or transcript in ith sample

Di:Upper quartile expression value in for ith sample

Various normalization methods available for the RNA-Seq data are discussed in detail in
http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long

If you are using Cuffdiff to estimate FPKM/RPKM then you can also perform upper quartile normalization internally by using the argument "--upper-quartile-norm".

Sunday, September 29, 2013

How much memory does TopHat require?

I've been using Tophat program for aligning the RNA-Sequencing reads to the reference genome. I mainly run my jobs on computing clusters where estimated resource (such as memory, cpu time) usage has to be specified in advance before submitting the jobs.
In the beginning, most of my Tophat jobs would fail due to the exceeding memory (RAM) usage, especially if there are hundreds of millions of reads in the fastq files. After looking into the Tophat run logs it was clear that Tophat uses extensive memory in the "joining segment hits" step where split alignments of a read are 'stitched' together to generate a contiguous alignment. In order to determine the memory usage, I ran Tophat multiple times on a sample RNA-Seq dataset (from human transcriptome) with varying number of reads in each run. Following are the details of Tophat run:

Reference genome: hg19 (GRCh37)
Read length: 50 (paired, insert length: 250, standard-deviation: 30)
Reference transcriptome: ensemble annotations (from UCSC)
Number of threads (cpus) assigned: 12
Fusion search: On (to get the maximum memory that Tophat may use)

Rest of the parameters for Tophat were kept on the default values.

I plotted the amount of memory used and the number of reads in the dataset (figure below).

As expected, memory usage increases (almost linearly) with the increasing number of reads in the data. Tophat needs at least of 4GB RAM to run you data against the human genome since it loads the entire genome-index in the RAM before starting the alignment. For large datasets such as 200million reads, memory usage can easily reach above 30GB (so not appropriate for desktop applications).

Hope it helps.

Friday, April 5, 2013

How to parse compressed text files using Perl

Large data files such as FASTQ files from sequencing runs may reach up to gigabytes in size and they can fill up system's disk easily and quickly. So they should better be kept compressed (*.gz or *.zip format) in order to save the disk space. Usually you save 4/5th of the original (uncompressed) file size space on the disk by compressing.
Some of the current bioinformatics tools (such as Picard tools) accept input files in the compressed format but the problem arises when files need to be parsed using custom scripts (such as perl scripts). One way is to uncompressed the files, parse it and re-compress it which may take significant amount of the computational time if data files are large(~gb to ~tb in size). Here is the simple way to parse the compressed files without uncompressing them:

#!/usr/bin/perl
use strict;
use warnings;

open FH, "gunzip -c <file_name> | ";

while(<FH>){ ## read single line from the file
## parse the string ##
}

close FH
if you have *.zip files, you can replace the command above with:

open FH, "unzip -p <file_name> | ";

The only downside of this method is that you may not be able to use file handles on your files. You can only parse files sequentially this way.

Monday, April 1, 2013

Picard "no space left on device" error

Sometimes Picard's tools terminate with the error message "no space left on device" even though there is plenty of space available on the disk.

In addition Picard will throw some java exception error messages such as:
Exception in thread "main" net.sf.samtools.util.RuntimeIOException: Write error;
.
.
Caused by: java.io.IOException: Input/output error
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:297)
    at net.sf.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:197)
.
.
etc.

By default Picard tries to use system's temporary directory which has a limited space allocated to it by the system. If you are working with very large data files, then the default temporary directory gets filled up quickly and Picard stops working.
To avoid this error message you create a temporary directory somewhere on your system and direct picard to use that as temporary directory. Specify TMP_DIR=/some_path/your_temporary_directory/

Sunday, March 31, 2013

Why short-read alignments are always not reproducible?

Some of you may have noticed that if you re-run a short-read aligner (such as BWA or SOAP) on the same data multiple times then the reported alignments are not identical. Short read alignment is always challenging due to the reference genome complexity and uncertainty in mapping relatively short read in the repetitive regions. So when the aligner finds a read mapping to multiple genomic loci with equal alignment score, it assigns the read randomly to a single genomic locus. Since this locus is chosen randomly by the aligner, alignments over multiple runs may not be identical.
To overcome this problem, bowtie (as well as bowtie2) follows a "pseudo random" selection procedure. Seed for the randomization is generated using the read name (and other attributes from the Fastq) for each read. It guarantees the exactly same output everytime bowtie is run. User may also specify a seed for the randomization and using the same seed will produce identical alignment from each bowtie run on the same data.

Monday, February 4, 2013

How to improve RNA-Seq alignment using TopHat

Hello readers,
TopHat is the program to align RNA-Sequencing reads to the reference genome. TopHat is a spliced-read aligner that aligns reads spanning the canonical splice-site. TopHat's performance and alignment rate (% of the mapped reads) depends on the data quality as well as the alignment stringency. Once the TopHat run is finished, the alignment rate can be further improved by doing the junction specific tophat alignment. Later, accepted_hits.bam files from this two-step alignment can be merged to generate an improved (hopefully) alignment bam file. This idea was originally proposed in Cabili et al.

TopHat output contains a file called "junctions.bed" that has all the splice-sites detected during the alignment. If your study contains multiple samples, "junctions.bed" files from the tophat run for each sample can be combined (or pooled) and can be used as a combined bigger junction site database for the alignment run. TopHat output also includes unmapped reads. In order to improve the overall alignment, these unmapped reads are aligned to the reference genome using the merged junction database.

For simplicity, I am dividing the whole process into four steps:

1.) Creating the pooled junction database:
You can do something like:

cat */junctions.bed > junctions.merged.bed

In order to run tophat, junctions.merged.bed file should be in a specific format called ".junc" format. This conversion can be easily done using another program, bed_to_juncs", present in the tophat program directory. junction file also needs to be sorted and duplicates must be removed.
You can run the following command to do all of the above steps (bed to junction conversion, duplicate junction removal and sorting ):

bed_to_juncs < junctions.merged.bed | sort -k 1,4 -u | sort -k 1,1 > junctions.merged.junc

Running this command will give you some warnings similar to "Warning: malformed line 1, missing columns
track name=junctions description="TopHat junctions"
but the output file should be fine. This warning is caused due to the presence of some header lines in junctions.merged.bed.

2.) Extracting the unmapped reads (or mates):
Unmapped reads from the first tophat run should be in the tophat output directory (which is tophat_out by default) either as unmapped.bam format (for tophat2) or as unmapped_left.fq.gz and unmapped_right.fq.gz (for tophat 1).

unmapped.bam can be converted to fastq or fasta format using bamToFastQ tool or any other similar utility of your choice.
unmapped_left.fq.gz and unmapped_right.fq.gz are compressed fastq files that can be decompressed using the gunzip command.

3.) Mapping:
Mapping of unmapped reads from the first alican be done by running the tophat again and specifying input files as unmapped_left/1.fastq and unmapped_2/.fastq.
You also need to specify the junctions.merged.junc file by including the following in the tophat command line:

-j /some_path/junctions.merged.junc --no-novel-junc
You can keep rest of the tophat parameters same as the previous tophat run.

One more thing to keep in mind is that for the second tophat run, make another output directory otherwise second tophat run will replace the output files of the original run. For convenience, you can create the second tophat run output directory within the original tophat output directory.

4.) Merging:
After the step 3, you will have two "accepted_hits.bam" files for each sample. These files can be merged together by using:
samtools merge merged.bam /path_to_tophat_original_run_out/accepted_hits.bam /path_to_tophat_second_run_out/accepted_hits.bam

"merged.bam" is the final alignment file that you can now use for the downstream analysis.

I have observed a 3 - 10% alignment rate improvement by following the above procedure on my datasets. This improvement may vary according to the data.

Also, junctions.merged.bed may also include junctions from other sources such as reference transcript annotations. Bigger the junction database is, better will be the alignment. But make sure keep the reference genome sequence consistent.

Note: unmapped_left.fq and unmapped_right.fq may not be paired properly. They may have unequal number of reads. You don't need to worry about that while following the above procedure since tophat will take care of that during the input data file processing. If you feel that mate-pair inconsistency is causing some problem during the alignment, you can read my previous blog on "how-to-fix-order-of-paired-end-reads" to fix this inconsistency.

Bioinformatics blog