Friday, April 5, 2013

How to parse compressed text files using Perl

Large data files such as FASTQ files from sequencing runs may reach up to gigabytes in size and they can fill up system's disk easily and quickly. So they should better be kept compressed (*.gz or *.zip format) in order to save the disk space. Usually you save 4/5th of the original (uncompressed) file size space on the disk by compressing.
Some of the current bioinformatics tools (such as Picard tools) accept input files in the compressed format but the problem arises when files need to be parsed using custom scripts (such as perl scripts). One way is to uncompressed the files, parse it and re-compress it which may take significant amount of the computational time if data files are large(~gb to ~tb in size). Here is the simple way to parse the compressed files without uncompressing them:

#!/usr/bin/perl
use strict;
use warnings;

open FH, "gunzip -c <file_name> | ";

        while(<FH>){  ## read single line from the file
        ## parse the string ##
       }

close FH
if you have *.zip files, you can replace the command above with:

open FH, "unzip -p <file_name> | ";


The only downside of this method is that you may not be able to use file handles on your files. You can only parse files sequentially this way.

17 comments:

  1. I just found your blog Vinay! Good content! I'm subscribing on Feedly.

    ReplyDelete
  2. I just found your blog Vinay! Good content! I'm subscribing on Feedly.

    ReplyDelete
  3. hmm maybe i'm missing something but if I use your command I get "gunzip is not recognized as an internal or external command etc.."

    i used it as

    $fasta = ; #enter name file
    open INPUT, "gunzip -c $fasta |";

    ...ideas?

    ReplyDelete
  4. Hey.. are you using Windows system?
    It will work only on Unix/Linux system.

    ReplyDelete
  5. for file handles, etc
    you could try Archive::Zip or IO::Uncompress::Unzip?

    (on cpan)

    ReplyDelete
  6. also this:
    https://metacpan.org/pod/Archive::Zip::SimpleZip

    ReplyDelete
  7. I used gunzip -c .gz command to read line by line from gz file and searched for the pattern but some lines are not displayed even though pattern is there.. how to overcome this issue

    ReplyDelete
  8. What type of file you are using? Text, binary, hex?

    ReplyDelete
  9. Not sure what's causing for your pattern to be not found. You can edit the script and simply print the text of the file on screen to check if the text file is being read.

    ReplyDelete
  10. it is huge log file .. when I unzip the log file and search for pattern , then am able to find .. but if I search for pattern in compressed file , then am not able find ..

    ReplyDelete
  11. you can print few lines from the top of the file on the shell and also from the script and see if lines are same. Also, please make sure that the regex you are using is compatible with Perl (there might be few differences in regex on shell vs. Perl)

    ReplyDelete
  12. after lot of investigation I found, seek is not working over data which is passed through pipe. I set the position through seek function, but it is not reflecting

    ReplyDelete
  13. Hi,
    'seek' won't work here as data is coming in to the script each line at a time from stdin. 'seek' works only if you are reading files via Perl's disk I/O (i.e. reading file using Perl directly from the Disk). Hope this helps.

    ReplyDelete
  14. then How to read compreesed files , IO::Uncompress::Gunzip will work ?

    ReplyDelete

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.