Bioinformatics blog: How to estimate insert size for paired-end RNA-Seq data

Tuesday, February 28, 2012

How to estimate insert size for paired-end RNA-Seq data

Reference genome alignment high-throughput RNA-seq reads generated from "paired-end" library often requires "mate-pair insert length" and associated standard deviation for alignment tools such as BowTie, TopHat or BWA to work properly. These values can be obtained from the library prep that was used for the sequencing. But these values are not always accompanied with the data in public databases hence need to be estimated computationally. Here are some simple steps:

Required sowftare: BowTie or BWA, Picard tools.
Reference transcriptome: Reference transcript sequences from the same species as the RNA-Seq data is from. Instead of reference transcript, reference genome can also be used. I prefer reference transcriptome since it saves a magnitude of alignment as well as downstream processing time over the whole genome and also generates more alignments as splice junction reads are easily aligned to the transcript sequence than the reference genome sequence.

Steps:
1.)Create reference transcriptome index for BowTie (or BWA).
2.)Align paired-end RNA-Seq data to the reference transcripts using minimum insert-length as zero and maximum insert-length as 500 (or more).
BowTie: -I 0 -X 500, BWA: -a 500. Set output format as SAM(BowTie) or BAM(BWA).
3.) Sort resulting SAM/BAM file using Picard's SortSam.jar utility as follows:
java -Xmx2g -jar picard-tools/SortSam.jar INPUT= OUTPUT= SORT_ORDER=coordinate
4.) Now run Picard's CollectInsertSizeMetrics.jar utility as follows:
java -Xmx2g -jar picard-tools/CollectInsertSizeMetrics.jar INPUT= HISTOGRAM_FILE= OUTPUT=
5.)Insert-size distribution is printed in output_file while corresponding histogram image file is generated as pdf in histogram_file.
6.)Insert-matrix output_file conain estimated insert-size and corresponding standard deviation. Picard generates two types estimates: "MEDIAN_INSERT_SIZE" and "MEAN_INSERT_SIZE". Anyoneone of them can be used but I prefer the second one since it is estmated after trimming off the outliers in from the original insert-size distribution in order to render normal distribution.

Note: If you plan to run TopHat on the same data aftrwards, set -r as (estimated insert-size-2*read_length) since TopHat requires distance between mate-pairs and not the insert-size.

28 comments:

Konstantin OkonechnikovApril 10, 2012 at 1:00 PM
You can also use Qualimap tool (http://qualimap.bioinfo.cipf.es/) to calculate the insert size histogram and mean insert size.
ReplyDelete
Replies
AnonymousMay 18, 2013 at 1:23 AM
Hi, thanks for the post. Do you know how important is the -r parameter for TopHat? They do not mention anything about that in the TopHat-Cufflinks protocole paper in Nature Protocols. Thank you
ReplyDelete
Replies
SandraMay 18, 2013 at 1:24 AM
Hi, thanks for the post. Do you know how important is the -r parameter for TopHat? They do not mention anything about that in the TopHat-Cufflinks protocole paper in Nature Protocols. Thank you
ReplyDelete
Replies
SandraMay 18, 2013 at 1:25 AM
Hi, thanks for the post. Do you know how important is the -r parameter for TopHat? They do not mention anything about that in the TopHat-Cufflinks protocole paper in Nature Protocols. Thank you
ReplyDelete
Replies
Vinay MittalMay 18, 2013 at 2:18 AM
Hi Sandra,
-r helps Tophat in the paired-end alignment. They do not mention it in their Nature paper but it is explained in the manual and discussed in Tophat's FAQs.
ReplyDelete
Replies
SandraMay 18, 2013 at 11:59 PM
thanks
ReplyDelete
Replies
athinodorosJune 3, 2013 at 4:26 AM
nice info on how to get the insert size. My question is what is the way to evaluate the mapping results with and without using the insert size information? What are the basic statistics one takes into account after mapping? any suggested tools?
ReplyDelete
Replies
athinodorosJune 3, 2013 at 5:35 AM
One more question; Distance between mate pairs is different than the inner size?
ReplyDelete
Replies
Vinay MittalJune 7, 2013 at 3:31 PM
@athinodoros:
If you are trying to map paired-end reads to the reference genome or transcriptome, you will have to specify the insert-size to the aligner. You can let the aligner use its default, if you prefer.
Post alignment evaluation is done by calculating the %of the initial number of reads aligned to the reference genome/transcriptome.
In case of RNA-Seq data, you may also want to check for the number(or %) of the spliced-reads. I have explained that here: http://vinaykmittal.blogspot.com/2012/09/how-to-count-number-of-splice-junction.html

Distance between the mate-pairs (form right most end of the left mate to the left most end of the right mate) is the inner distance of mate pairs.
ReplyDelete
Replies
Saad KhanAugust 23, 2013 at 1:34 PM
If we already know the fragment size information for a particular SRA file. What is the way of getting the correct -r option to use without necessary doing the aligning and other steps.

For example : my fragment size is
200 - 250 and the read length is 101
so can I choose my insert size or -r option to be 150?

thanks
Saad
ReplyDelete
Replies
Saad KhanAugust 23, 2013 at 1:35 PM
If we already know the fragment size information for a particular SRA file. What is the way of getting the correct -r option to use without necessary doing the aligning and other steps.

For example : my fragment size is
200 - 250 and the read length is 101
so can I choose my insert size or -r option to be 150?

thanks
Saad
ReplyDelete
Replies
Vinay MittalAugust 23, 2013 at 3:01 PM
Hi Saad,
I believe by "-r" you mean tophat's argument for the inner distance. You can infer "r" using the information from SRA metadata. In your case "r" would be 250 - 2*1001 = 48.

Fragment length (or the insert length) given by SRA is the theoretical one which is derived before the sequencing and during the library preparation. This can change during the sequencing. You can still use the theoretical values but if you want to be more accurate than it is better to estimate the insert length using the read data.
ReplyDelete
Replies
Saad KhanAugust 27, 2013 at 5:59 PM
Following is the result that I am getting when I use this transcriptome :

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/mrna.fa.gz

MEDIAN_INSERT_SIZE MEDIAN_ABSOLUTE_DEVIATION MIN_INSERT_SIZE MAX_INSERT_SIZE MEAN_INSERT_SIZE STANDARD_DEVIATION READ_PAIRS PAIR_ORIENTATION WIDTH_OF_10_PERCENT WIDTH_OF_20_PERCENT WIDTH_OF_30_PERCENT WIDTH_OF_40_PERCENT WIDTH_OF_50_PERCENT WIDTH_OF_60_PERCENT WIDTH_OF_70_PERCENT WIDTH_OF_80_PERCENT WIDTH_OF_90_PERCENT WIDTH_OF_99_PERCENT SAMPLE LIBRARY READ_GROUP
267 17 20 4651 264.352882 31.269978 8672422 FR 7 13 19 27 35 43 51 63 81 319

The median(267) and/or mean (264.35) insert size is vastly different from elucidating the value from SRA in the above post (which is 48 btw). Do you think I might be doing something wrong here, I have exactly followed comments in your blog post.

Let me know which value seems more reasonable to you 48 or median/mean insert size detected by picard.
ReplyDelete
Replies
Vinay MittalAugust 28, 2013 at 1:24 PM
Hi Saad,
There are two things going on here: first is "r" which is the inner distance (distance between the closest ends of the reads) and second is the "insert size" which is the inner distance + 2*read length.

Your insert size (reported by Picard) is 267 or 264. But for tophat, you need "-r" i.e. inner distance (instead of the insert size) which is insert-length - 2*readLength.
ReplyDelete
Replies
UnknownSeptember 30, 2013 at 1:27 AM
Hi vinay am not able to be clear on calculating -r

MEDIAN_INSERT_SIZE MEDIAN_ABSOLUTE_DEVIATION MIN_INSERT_SIZE MAX_INSERT_SIZE MEAN_INSERT_SIZE STANDARD_DEVIATION READ_PAIRS PAIR_ORIENTATION WIDTH_OF_10_PERCENT WIDTH_OF_20_PERCENT WIDTH_OF_30_PERCENT WIDTH_OF_40_PERCENT WIDTH_OF_50_PERCENT WIDTH_OF_60_PERCENT WIDTH_OF_70_PERCENT WIDTH_OF_80_PERCENT WIDTH_OF_90_PERCENT WIDTH_OF_99_PERCENT SAMPLE LIBRARY READ_GROUP
159 29 77 500 175.352682 61.084072 6486431 FR 13 23 35 47 59 73 89 117 211 541
My Read was prepared for the 76bp PE run. The average fragment size of the library is 300bp.

so my r should be 175 - 2 (76) = 23

Is it correct?

ReplyDelete
Replies
Vinay MittalSeptember 30, 2013 at 1:31 AM
Yes...that's correct.
ReplyDelete
Replies
AnonymousJanuary 23, 2014 at 4:09 PM
Hi-
Just wanted to check if you are still active on this blog as I have a general Picard question.
ReplyDelete
Replies
Vinay MittalJanuary 23, 2014 at 4:10 PM
Yes.... the blog is still active.
ReplyDelete
Replies
UnknownMarch 12, 2014 at 3:46 PM
Hi Vinay, Your blog is really useful, thank you for your efforts.

I have a question about Insert size by CollectInsertSizeMetrics by Picard, and the estimated Insert size by Cufflinks. In my case, the actual insert size = 320 bp, read length = 100 bp (PE run), inner distance = 120 bp.

Now for a particular sample,
by Picard, Insert size Mean = 416.33 and Median = 373
by Cufflinks, Est. Mean = 252.20, Est. Std. dev. = 75.99

I was concerned that the Insert size doesn't match across the two software's. Do you have any comment on this as to why they differ?

Any comment highly appreciated.

Thanks.
ReplyDelete
Replies
UnknownMarch 12, 2014 at 3:46 PM
Hi Vinay, Your blog is really useful, thank you for your efforts.

I have a question about Insert size by CollectInsertSizeMetrics by Picard, and the estimated Insert size by Cufflinks. In my case, the actual insert size = 320 bp, read length = 100 bp (PE run), inner distance = 120 bp.

Now for a particular sample,
by Picard, Insert size Mean = 416.33 and Median = 373
by Cufflinks, Est. Mean = 252.20, Est. Std. dev. = 75.99

I was concerned that the Insert size doesn't match across the two software's. Do you have any comment on this as to why they differ?

Any comment highly appreciated.

Thanks.
ReplyDelete
Replies
Vinay MittalMarch 12, 2014 at 3:54 PM
Hi Shweta,
I am not 100% sure what's happening here but it looks like that if you add your read length to the cufflinks estimates, it almost adds up to your Picard estimates.
There is some discussion on this here (not very conclusive though):
http://seqanswers.com/forums/archive/index.php/t-7383.html

So I believe, cufflinks is reporting the inner distance instead of the original fragment-size (or insert -size). Since, cufflinks is estimating this internally, I wouldn't worry about the difference.
ReplyDelete
Replies
UnknownMarch 12, 2014 at 4:00 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMarch 30, 2014 at 7:11 PM
Vinay I am new to the bowtie2 samtools picard tophat RNA analysis process. Would you be able to answer a few of my questions regarding what I am doing exactly? I am currently an undergraduate student trying to learn
ReplyDelete
Replies
UnknownMarch 30, 2014 at 7:13 PM
Hi Vinay, I am new to RNA analysis using bowtie2, samtools, picard and tophat2. Would you be able to answer a few questions for me so that I can understand what exactly I am doing? Currently I am an undergraduate student trying to learn
ReplyDelete
Replies
UnknownMarch 30, 2014 at 7:14 PM
anmdave@gmail.com forgot to leave this for your response! Thanks Vinay.
ReplyDelete
Replies
yousraApril 18, 2014 at 8:32 AM
Hi Shweta Chavan,
I have a question about Map Splice? Actually, I am interested in backsplices. So I use Map Splice with the option --fusion-non-canonical. I got this ouput which is not very clear and I have check their manual and it is not useful. Anybody have an idea about the meaning of these informations. In their manual, they describe only 20 columns althghou it seems that there is over 30 columns in their ouput. Please Help! Specially the start and the end coordinations: They are very strange: it's over 300000 kb which is totally wrong in yeast.
chr13~chr03 378626 106516 FUSIONJUNC_103 5 ++ 255,0,0 2 25,36,137,178, 0,272136, 1.332179 6 GTAG 1 1 1.000000 25 0 1 5 04 1 3 0 4 1 2 378601 106552 106516,36M84P57M| 378491,62M49P25M| 0 0 0.591111 0.84 93 93 87 87 1.33218 0.0196078 1 93 87 166 112 124.2 1 0 0 1
chr10~chr07 453908 883299 FUSIONJUNC_336 6 -- 255,0,0 2 32,36,346,156, 0,18446744073709122258, 1.560710 6 GTAG 0 1 0.500000 23 19 5 60 3 1 2 0 3 3 12 453940 883263 883004,296M| 453908,86M114P145M| 0 0 0.199365 0.244444 296 296 231 231 2.19722 0.00980392 1 296 243 361 170 299.222 6 0 3 3
chr10~chr07 454012 883299 FUSIONJUNC_337 3 -- 255,0,0 2 23,31,314,31, 0,18446744073709122353, 0.636514 6 GTAG 0 1 0.666667 23 0 5 30 3 0 3 0 3 0 8 454035 883268 883109,191M| 454012,23M73P122M|454012,23M239P51M| 0 0 0.275556 0.36 191 191 145 74 1.74787 0.0130719 1 191 145 286 102 194.429 4 0 3 1
chr07~chr10 883300 338619 FUSIONJUNC_338 2 -+ 255,0,0 2 35,34,164,124, 0,18446744073709006970, 0.693147 5 GTAG 0 2 1.000000 17 16 17 20 2 1 1 0 2 0 12 883335 338653 883300,75M37P108M| 338619,225M| 0 0 0.296111 0.478333 183 183 225 225 1.90615 0.0196078 1 220 225 395 140 261.25 6 0 2 4
chr12~chr12 468872 455123 FUSIONJUNC_522 3 +- 255,0,0 2 19,46,13800,109, 0,13769, 0.636514 6 GTAG 0 1 0.666667 19 0 13 30 3 1 2 0 3 0 46 468853 455077 454830,294M| 455074,57M13689P53M| 0 0 0.319365 0.737778 294 294 110 110 3.04482 0.0130719 1 294 145 370 122 208.962 0 23 0 23
chr12~chr12 468921 456520 FUSIONJUNC_524 7 +- 255,0,0 2 28,43,12388,143, 0,12430, 1.277034 6 GTAG 0 1 0.285714 10 23 5 70 7 5 2 0 7 0 44 468893 456477 456233,288M| 456535,51M12234P102M| 0 0 0.298291 0.584215 288 288 153 153 3.01463 0.00560224 1 288 154 372 118 208.69 0 22 0 22
chr12~chrmt 459784 59580 FUSIONJUNC_576 9 ++ 255,0,0 2 41,41,103,144, 0,18446744073709151454, 2.043192 5 GTAG 0 1 0.333333 25 25 1 90 9 5 4 0 9 0 50 459743 59621 459683,102M| 59580,305M| 0 0 0.655686 0.252857 102 102 305 305 2.84117 0.00653595 1 102 305 396 108 204.941 0 25 25 0
chr12~chrmt 489457 59580 FUSIONJUNC_578 2 ++ 255,0,0 2 27,24,3726,41, 0,18446744073709121764, 0.000000 5 GTAG 1 1 1.000000 0 24 3 20 2 2 0 0 2 0 100 489430 59604 485704,109M3543P102M| 59580,305M| 0 0 0.288889 0.195556 211 211 305 305 3.23797 0.0196078 1 213 305 396 129 267.135 0 50 50 0
c
ReplyDelete
Replies
yousraApril 18, 2014 at 8:32 AM
Hi Shweta Chavan,
I have a question about Map Splice? Actually, I am interested in backsplices. So I use Map Splice with the option --fusion-non-canonical. I got this ouput which is not very clear and I have check their manual and it is not useful. Anybody have an idea about the meaning of these informations. In their manual, they describe only 20 columns althghou it seems that there is over 30 columns in their ouput. Please Help! Specially the start and the end coordinations: They are very strange: it's over 300000 kb which is totally wrong in yeast.
chr13~chr03 378626 106516 FUSIONJUNC_103 5 ++ 255,0,0 2 25,36,137,178, 0,272136, 1.332179 6 GTAG 1 1 1.000000 25 0 1 5 04 1 3 0 4 1 2 378601 106552 106516,36M84P57M| 378491,62M49P25M| 0 0 0.591111 0.84 93 93 87 87 1.33218 0.0196078 1 93 87 166 112 124.2 1 0 0 1
chr10~chr07 453908 883299 FUSIONJUNC_336 6 -- 255,0,0 2 32,36,346,156, 0,18446744073709122258, 1.560710 6 GTAG 0 1 0.500000 23 19 5 60 3 1 2 0 3 3 12 453940 883263 883004,296M| 453908,86M114P145M| 0 0 0.199365 0.244444 296 296 231 231 2.19722 0.00980392 1 296 243 361 170 299.222 6 0 3 3
chr10~chr07 454012 883299 FUSIONJUNC_337 3 -- 255,0,0 2 23,31,314,31, 0,18446744073709122353, 0.636514 6 GTAG 0 1 0.666667 23 0 5 30 3 0 3 0 3 0 8 454035 883268 883109,191M| 454012,23M73P122M|454012,23M239P51M| 0 0 0.275556 0.36 191 191 145 74 1.74787 0.0130719 1 191 145 286 102 194.429 4 0 3 1
chr07~chr10 883300 338619 FUSIONJUNC_338 2 -+ 255,0,0 2 35,34,164,124, 0,18446744073709006970, 0.693147 5 GTAG 0 2 1.000000 17 16 17 20 2 1 1 0 2 0 12 883335 338653 883300,75M37P108M| 338619,225M| 0 0 0.296111 0.478333 183 183 225 225 1.90615 0.0196078 1 220 225 395 140 261.25 6 0 2 4
chr12~chr12 468872 455123 FUSIONJUNC_522 3 +- 255,0,0 2 19,46,13800,109, 0,13769, 0.636514 6 GTAG 0 1 0.666667 19 0 13 30 3 1 2 0 3 0 46 468853 455077 454830,294M| 455074,57M13689P53M| 0 0 0.319365 0.737778 294 294 110 110 3.04482 0.0130719 1 294 145 370 122 208.962 0 23 0 23
chr12~chr12 468921 456520 FUSIONJUNC_524 7 +- 255,0,0 2 28,43,12388,143, 0,12430, 1.277034 6 GTAG 0 1 0.285714 10 23 5 70 7 5 2 0 7 0 44 468893 456477 456233,288M| 456535,51M12234P102M| 0 0 0.298291 0.584215 288 288 153 153 3.01463 0.00560224 1 288 154 372 118 208.69 0 22 0 22
chr12~chrmt 459784 59580 FUSIONJUNC_576 9 ++ 255,0,0 2 41,41,103,144, 0,18446744073709151454, 2.043192 5 GTAG 0 1 0.333333 25 25 1 90 9 5 4 0 9 0 50 459743 59621 459683,102M| 59580,305M| 0 0 0.655686 0.252857 102 102 305 305 2.84117 0.00653595 1 102 305 396 108 204.941 0 25 25 0
chr12~chrmt 489457 59580 FUSIONJUNC_578 2 ++ 255,0,0 2 27,24,3726,41, 0,18446744073709121764, 0.000000 5 GTAG 1 1 1.000000 0 24 3 20 2 2 0 0 2 0 100 489430 59604 485704,109M3543P102M| 59580,305M| 0 0 0.288889 0.195556 211 211 305 305 3.23797 0.0196078 1 213 305 396 129 267.135 0 50 50 0
c
ReplyDelete
Replies

Add comment

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.

Bioinformatics blog

Tuesday, February 28, 2012

How to estimate insert size for paired-end RNA-Seq data

28 comments:

Blog Archive

Search This Blog