Monday, November 26, 2012

How to parse a BAM file using Perl

Hello readers,
I have been working with the analysis of junction reads using bam files generated by RNA-Seq read reference genome alignment. I had to write my own customized perl scripts to perform my analysis. Since BAM files are binary files and not simple ASCII text files, these can not be read by perl's default IO directly. Usually people use BioPerl modules to access the BAM files but here is an easier way to do it in perl:


#!/usr/bin/perl
use strict;
use warnings;

open BAM,"samtools view <bam_file> |";

        while(<BAM>){
        next if(/^(\@)/);  ## skipping the header lines (if you used -h in the samools command)
        s/\n//;  s/\r//;  ## removing new line
        my @sam = split(/\t+/);  ## splitting SAM line into array

        ## now use @sam as regular array
       }

Note:
You need to have samtools installed on your system.
Samtools will output alignments as SAM format that follows 1-based coordinate system. So parse the alignment data accordingly.


       
               



2 comments:

  1. Thanks so much. This tip was very helpful to me :)

    ReplyDelete
  2. Useful tip, thanks.
    A few things to consider:

    1) You are using \t+ as the field delimitter, but that will result in merging multiple fields together if the field value is empty. Tyically, sam doesn't allow a blank input (preferring *), but \t is a safer split in this case.

    2) Why not use chomp instead of the regexes? If you're starting from a bam, samtools should output the sam with line-endings appropriate to the current system, so chomp would be appropriate.

    ReplyDelete

Comment moderation has been enabled. All comments must be approved by the blog author. Please type your comment below and hit 'Publish'.