Large data files such as FASTQ files from sequencing runs may reach up to gigabytes in size and they can fill up system's disk easily and quickly. So they should better be kept compressed (*.gz or *.zip format) in order to save the disk space. Usually you save 4/5th of the original (uncompressed) file size space on the disk by compressing.
Some of the current bioinformatics tools (such as Picard tools) accept input files in the compressed format but the problem arises when files need to be parsed using custom scripts (such as perl scripts). One way is to uncompressed the files, parse it and re-compress it which may take significant amount of the computational time if data files are large(~gb to ~tb in size). Here is the simple way to parse the compressed files without uncompressing them:
#!/usr/bin/perl
use strict;
use warnings;
open FH, "gunzip -c <file_name> | ";
while(<FH>){ ## read single line from the file
## parse the string ##
}
close FH
if you have *.zip files, you can replace the command above with:
open FH, "unzip -p <file_name> | ";
The only downside of this method is that you may not be able to use file handles on your files. You can only parse files sequentially this way.
Some of the current bioinformatics tools (such as Picard tools) accept input files in the compressed format but the problem arises when files need to be parsed using custom scripts (such as perl scripts). One way is to uncompressed the files, parse it and re-compress it which may take significant amount of the computational time if data files are large(~gb to ~tb in size). Here is the simple way to parse the compressed files without uncompressing them:
#!/usr/bin/perl
use strict;
use warnings;
open FH, "gunzip -c <file_name> | ";
while(<FH>){ ## read single line from the file
## parse the string ##
}
close FH
if you have *.zip files, you can replace the command above with:
open FH, "unzip -p <file_name> | ";
The only downside of this method is that you may not be able to use file handles on your files. You can only parse files sequentially this way.