Sunday, September 29, 2013

How much memory does TopHat require?

I've been using Tophat program for aligning the RNA-Sequencing reads to the reference genome. I mainly run my jobs on computing clusters where estimated resource (such as memory, cpu time) usage has to be specified in advance before submitting the jobs.
In the beginning, most of my Tophat jobs would fail due to the exceeding memory (RAM) usage, especially if there are hundreds of millions of reads in the fastq files. After looking into the Tophat run logs it was clear that Tophat uses extensive memory in the "joining segment hits" step where split alignments of a read are 'stitched' together to generate a contiguous alignment.  In order to determine the memory usage, I ran Tophat multiple times on a sample RNA-Seq dataset (from human transcriptome) with varying number of reads in each run.  Following are the details of Tophat run:

Reference genome: hg19 (GRCh37)
Read length: 50 (paired, insert length: 250, standard-deviation: 30)
Reference transcriptome: ensemble annotations (from UCSC)
Number of threads (cpus) assigned: 12
Fusion search: On (to get the maximum memory that Tophat may use)

Rest of the parameters for Tophat were kept on the default values.

I plotted the amount of memory used and the number of reads in the dataset (figure below).

As expected, memory usage increases (almost linearly) with the increasing number of reads in the data. Tophat needs at least of 4GB RAM to run you data against the human genome since it loads the entire genome-index in the RAM before starting the alignment. For large datasets such as 200million reads, memory usage can easily reach above 30GB (so not appropriate for desktop applications).
Hope it helps.

No comments:

Post a Comment

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.