Monday, November 26, 2012

How to parse a BAM file using Perl

Hello readers,
I have been working with the analysis of junction reads using bam files generated by RNA-Seq read reference genome alignment. I had to write my own customized perl scripts to perform my analysis. Since BAM files are binary files and not simple ASCII text files, these can not be read by perl's default IO directly. Usually people use BioPerl modules to access the BAM files but here is an easier way to do it in perl:


#!/usr/bin/perl
use strict;
use warnings;

open BAM,"samtools view <bam_file> |";

        while(<BAM>){
        next if(/^(\@)/);  ## skipping the header lines (if you used -h in the samools command)
        s/\n//;  s/\r//;  ## removing new line
        my @sam = split(/\t+/);  ## splitting SAM line into array

        ## now use @sam as regular array
       }

Note:
You need to have samtools installed on your system.
Samtools will output alignments as SAM format that follows 1-based coordinate system. So parse the alignment data accordingly.