Tuesday, November 12, 2013

Difference between Cuffcompare and Cuffmerge



Cufflinks is one of the most commonly used program for reference-genome based transcriptome assembly, and expression estimation and differential expression analysis. Cufflinks comes with two supplementary programs (in addition to few others) for post cufflinks workflow:: Cuffcompare and Cuffmerge.

Although Cuffcompare and Cuffmerge may seem to perform same task when it comes to handling multiple transcriptome assemblie, there are sitll substantial differences among the two.Cufflinks manual, Tuxedo pipeline (Bowtie-TopHat-Cufflinks-Cuffdiff) paper and forum posts from the developers have pointed out these differences but still I get a lot of questions regarding the same. I will try to explain differences to make it more clear.

Cuffcompare and Cuffmerge both are used to merge multiple transcript assemblies but in a little different manner.
Cuffcompare takes all the transcripts from multiple assemblies (in GTF format) and creates a union of all the transcripts where all the redundant transcripts are removed. Cuffcompare does not change any of the assembled transcript in any of the assembly instead it simply compares the coordinates of the transcripts.
Resulting file, "combined.gtf", contains a set of "unified" transcripts across all the assemblies.  The "combined.gtf" file can be used as the reference GTF file for the quantification across the samples using Cuffdiff (another program in Cufflinks toolkit).

Cuffmerge, on the other hand, creates a "merged" set of transcripts form multiple assemblies. During this merging transcripts from all the assemblies (GTF files) are converted to representative reads in SAM format and Cufflinks (original assembly program) is run internally to see of there is any gaps that can be filled and a longer consensus sequence can be created. Basically, Cuffmerge merges transcripts that are overlapping  and share a similar exon structure (or splicing structure) to generate a longer chain of connected exons.


Overall, Cuffcompare will generate a non-redundant set of transcripts while Cuffmerge will generate a more consensus assembly form a multiple set of assemblies. So from Cuffmerge you get a cleaner, somewhat more complete assembly and ,generally, fewer number of assembled transcript as compared to the transcripts from Cuffcompare.

Additional note:
Cuffcompare is a more comprehensive program than simply a tool to combine assemblies. For example, ".trackinbg" file generated by Cuffcompare contains the information about how many samples each transcripts was present so that you can the idea of multiplicity (recurrence) of each transcript across multiple samples. Cuffcompare can also annotate your transcript assemblies using a reference annotation files (in GTF format) and will assign reference transcript Id( such as ensemnbl id) and gene symbol to the assembled transcripts.