Monday, February 27, 2012

How to extract paired-end reads from SRA files

SRA(NCBI) stores all the sequencing run as single "sra" or "lite.sra" file. You may want separate files if you want to use the data from paired-end sequencing. When I run SRA toolkit's "fastq-dump" utility on paired-end sequencing SRA files, sometimes I get only one files where all the mate-pairs are stored in one file rather than two or three files.
The solution for the problem is to always run fastq-dump with "--split-3" option. If the experiment is single-end sequencing, only one fastq file will be generated. If it is paired-end sequencing, there may be two or three fastq files.
Two files (with suffix "_1" and "_2") are matched mate-pair read file where as the third one (without any suffix) contains all the reads that do not have any mate-paires (or SRA couldn't resolve mate-paires for them).

Hope my experiences with NCBI SRA data handling help the readership.

29 comments:

  1. That post saved me a lot of trouble. Thank you!

    ReplyDelete
  2. very useful, thank you.

    ReplyDelete
  3. Great!! This post was a time and a frustration saver

    ReplyDelete
  4. Thank you very much. Your post helps me a lot!

    ReplyDelete
  5. --split-files should also do the same, which would also create files equal to the number of reads

    ReplyDelete
  6. On the NCBI-SRA web site, often there is no information on whether the runs are single end or paired. I guess it is useful to use the option mentioned by you as default, but does any one know how to find this info on the SRA web site.

    ReplyDelete
  7. Deepak PurushothamApril 4, 2013 at 12:36 PM

    Awesome Vinay! Was looking for just this option

    Thanks :)

    ReplyDelete
  8. Deepak PurushothamApril 4, 2013 at 12:38 PM

    Thanks Vinay :) Was looking for exactly this

    ReplyDelete
  9. Thank you so much.
    I have had this problem with months, and I didn't know what am I doing wrong :)
    Thank you once more

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. That is one blog I would like to follow! Thanks!!

    ReplyDelete
  12. Hi On using the following command:
    $ fastq-dump --split-files -A SRR030257.lite.sra
    I am not able produce a fastq file. I shows an error. I have set up the environmental variable, but still it does not work. However on using the following command:
    $./illumina-dump --table-path ./SRRO30257.sra --outdir foldername -qseq 1
    I am able to generate a file which i later convert it to fastq via a perl program !!

    ReplyDelete
  13. Hi On using the following command:
    $ fastq-dump --split-files -A SRR030257.lite.sra
    I am not able produce a fastq file. I shows an error. I have set up the environmental variable, but still it does not work. However on using the following command:
    $./illumina-dump --table-path ./SRRO30257.sra --outdir foldername -qseq 1
    I am able to generate a file which i later convert it to fastq via a perl program !!

    ReplyDelete
  14. Hi On using the following command:
    $ fastq-dump --split-files -A SRR030257.lite.sra
    I am not able produce a fastq file. I shows an error. I have set up the environmental variable, but still it does not work. However on using the following command:
    $./illumina-dump --table-path ./SRRO30257.sra --outdir foldername -qseq 1
    I am able to generate a file which i later convert it to fastq via a perl program !!

    ReplyDelete
  15. Hi On using the following command:
    $ fastq-dump --split-files -A SRR030257.lite.sra
    I am not able produce a fastq file. I shows an error. I have set up the environmental variable, but still it does not work. However on using the following command:
    $./illumina-dump --table-path ./SRRO30257.sra --outdir foldername -qseq 1
    I am able to generate a file which i later convert it to fastq via a perl program !!

    ReplyDelete
  16. Hi Alok,
    Do not use "-A" argument. Try this command instead:
    /path_to_folder/fastq-dump --split-3 -O SRR030257.lite.sra

    ReplyDelete
  17. Thanks, this is very helpful!

    ReplyDelete
  18. Thanks a lot. Exactly what I was looking for.

    ReplyDelete
  19. Great tip! It's easy to miss that split-3 argument.

    ReplyDelete
  20. Thank you very much!

    ReplyDelete
  21. WoooW man, you saved my life.

    I was stuck with fastq-dumb generating two in compatible files, when i used the STAR aligner against them, i was getting this error:

    EXITING because of FATAL ERROR: Read1 and Read2 are not consistent, reached the end of the one before the other one
    SOLUTION: Check you your input files: they may be corrupted


    At the end i used --split3 option and it generated 3 files, _1.fastq and _2.fastq and .fastq.

    I used _1.fastq and _2.fastq and i got 95% alignment score!!

    Again, thank you.

    ReplyDelete
  22. Great to see your post. Some weeks ago I get into the same situation of dealing with fastq-dump for paired-end data, and I discovered the "--split-3" argument. I made some test with fastq-dump using differents paramters configurations. You can check the tests in this Biostar's post: https://www.biostars.org/p/213348/#213457.

    Greetings.

    ReplyDelete
  23. Thank you for this post. It helped me a lot!

    ReplyDelete

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.