Bioinformatics blog: How to parse compressed text files using Perl

Friday, April 5, 2013

How to parse compressed text files using Perl

Large data files such as FASTQ files from sequencing runs may reach up to gigabytes in size and they can fill up system's disk easily and quickly. So they should better be kept compressed (*.gz or *.zip format) in order to save the disk space. Usually you save 4/5th of the original (uncompressed) file size space on the disk by compressing.
Some of the current bioinformatics tools (such as Picard tools) accept input files in the compressed format but the problem arises when files need to be parsed using custom scripts (such as perl scripts). One way is to uncompressed the files, parse it and re-compress it which may take significant amount of the computational time if data files are large(~gb to ~tb in size). Here is the simple way to parse the compressed files without uncompressing them:

#!/usr/bin/perl
use strict;
use warnings;

open FH, "gunzip -c <file_name> | ";

while(<FH>){ ## read single line from the file
## parse the string ##
}

close FH
if you have *.zip files, you can replace the command above with:

open FH, "unzip -p <file_name> | ";

The only downside of this method is that you may not be able to use file handles on your files. You can only parse files sequentially this way.

17 comments:

Lee KatzJuly 5, 2013 at 11:45 AM
I just found your blog Vinay! Good content! I'm subscribing on Feedly.
ReplyDelete
Replies
Lee KatzJuly 5, 2013 at 11:46 AM
I just found your blog Vinay! Good content! I'm subscribing on Feedly.
ReplyDelete
Replies
Vinay MittalJuly 5, 2013 at 12:31 PM
Thanks, Lee.
ReplyDelete
Replies
AnonymousSeptember 12, 2013 at 10:50 AM
hmm maybe i'm missing something but if I use your command I get "gunzip is not recognized as an internal or external command etc.."

i used it as

$fasta = ; #enter name file
open INPUT, "gunzip -c $fasta |";

...ideas?
ReplyDelete
Replies
Vinay MittalSeptember 13, 2013 at 1:25 PM
Hey.. are you using Windows system?
It will work only on Unix/Linux system.
ReplyDelete
Replies
michaelmd101October 16, 2015 at 1:55 AM
for file handles, etc
you could try Archive::Zip or IO::Uncompress::Unzip?

(on cpan)
ReplyDelete
Replies
michaelmd101October 16, 2015 at 2:04 AM
also this:
https://metacpan.org/pod/Archive::Zip::SimpleZip
ReplyDelete
Replies
AnonymousSeptember 30, 2016 at 2:54 PM
I used gunzip -c .gz command to read line by line from gz file and searched for the pattern but some lines are not displayed even though pattern is there.. how to overcome this issue
ReplyDelete
Replies
Vinay MittalSeptember 30, 2016 at 3:00 PM
What type of file you are using? Text, binary, hex?
ReplyDelete
Replies
UnknownOctober 3, 2016 at 5:11 PM
text file only
ReplyDelete
Replies
Vinay MittalOctober 3, 2016 at 5:13 PM
Not sure what's causing for your pattern to be not found. You can edit the script and simply print the text of the file on screen to check if the text file is being read.
ReplyDelete
Replies
UnknownOctober 3, 2016 at 5:19 PM
it is huge log file .. when I unzip the log file and search for pattern , then am able to find .. but if I search for pattern in compressed file , then am not able find ..
ReplyDelete
Replies
Vinay MittalOctober 3, 2016 at 5:21 PM
you can print few lines from the top of the file on the shell and also from the script and see if lines are same. Also, please make sure that the regex you are using is compatible with Perl (there might be few differences in regex on shell vs. Perl)
ReplyDelete
Replies
AnonymousDecember 7, 2016 at 10:29 AM
after lot of investigation I found, seek is not working over data which is passed through pipe. I set the position through seek function, but it is not reflecting
ReplyDelete
Replies
Vinay MittalDecember 7, 2016 at 10:52 AM
Hi,
'seek' won't work here as data is coming in to the script each line at a time from stdin. 'seek' works only if you are reading files via Perl's disk I/O (i.e. reading file using Perl directly from the Disk). Hope this helps.
ReplyDelete
Replies
AnonymousDecember 7, 2016 at 10:58 AM
then How to read compreesed files , IO::Uncompress::Gunzip will work ?
ReplyDelete
Replies
Vinay MittalDecember 7, 2016 at 11:12 AM
That may work but never used that.
ReplyDelete
Replies

Add comment

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.

Bioinformatics blog

Friday, April 5, 2013

How to parse compressed text files using Perl

17 comments:

Blog Archive

Search This Blog