Friday, April 5, 2013

How to parse compressed text files using Perl

Large data files such as FASTQ files from sequencing runs may reach up to gigabytes in size and they can fill up system's disk easily and quickly. So they should better be kept compressed (*.gz or *.zip format) in order to save the disk space. Usually you save 4/5th of the original (uncompressed) file size space on the disk by compressing.
Some of the current bioinformatics tools (such as Picard tools) accept input files in the compressed format but the problem arises when files need to be parsed using custom scripts (such as perl scripts). One way is to uncompressed the files, parse it and re-compress it which may take significant amount of the computational time if data files are large(~gb to ~tb in size). Here is the simple way to parse the compressed files without uncompressing them:

#!/usr/bin/perl
use strict;
use warnings;

open FH, "gunzip -c <file_name> | ";

        while(<FH>){  ## read single line from the file
        ## parse the string ##
       }

close FH
if you have *.zip files, you can replace the command above with:

open FH, "unzip -p <file_name> | ";


The only downside of this method is that you may not be able to use file handles on your files. You can only parse files sequentially this way.

Monday, April 1, 2013

Picard "no space left on device" error

Sometimes Picard's tools terminate with the error message  "no space left on device" even though there is plenty of space available on the disk.

In addition Picard will throw some java exception error messages such as:
Exception in thread "main" net.sf.samtools.util.RuntimeIOException: Write error;
.
.
Caused by: java.io.IOException: Input/output error
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:297)
    at net.sf.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:197)
.
.
etc.


By default Picard tries to use system's temporary directory which has a limited space allocated to it by the system. If you are working with very large data files, then the default temporary directory gets filled up quickly and Picard stops working.
To avoid this error message you create a temporary directory somewhere on your system and direct picard to use that as temporary directory. Specify TMP_DIR=/some_path/your_temporary_directory/