r36atd has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I'm reasonably new to perl and very new to BioPerl, so my apologies if this seems like a trivial question. I'm using Bio::AlignIO and Bio::SimpleAlign to generate pairwise alignments of sequences of interest to a reference sequence - in this case the human mtDNA reference sequence. I'm able to generate the alignment in fasta format without issue, however I wish to record the differences between the reference and the seuqences of interest in a tab delimited file for use with the webtool Haplogrep (http://haplogrep.uibk.ac.at/index.html). I can imagine doing this as loop whereby I go through each base of the reference sequence and say if the position in the reference does not match the query sequence, record the position and the nucleotide present in the query sequence. However, I can't figure out how to do this from the Bio::AlignIO object generated as part of the alignment. Here is the code I have so far:

#!/usr/bin/perl use strict; use warnings; use Bio::AlignIO; use Bio::SimpleAlign; my $ref = "rCRS.fas"; my $comp = shift @ARGV or die "please provide comparison sequence: $!\ +n"; my $comp_id; if ($comp =~ /(\S*)\.fas*/) { $comp_id = $1; } my $CAT = "cat $ref ".$comp." > ".$comp.".tmp"; try_cmd($CAT); my $ALN = "muscle -in ".$comp.".tmp -out ".$comp.".aln"; try_cmd($ALN); my $str = Bio::AlignIO->new( -file => $comp.".aln", -format => "fasta", ); my $aln = $str->next_aln(); ####Subroutine#### sub try_cmd { my $cmd = shift; my $status; $status = system($cmd); if ( $status ) { print STDERR ( "problem running $cmd\nReturned $status n" ); } else { print STDERR ( "Successfully ran $cmd\n" ); } }

So to give a brief example, if I had generated an alignment such as this one:

>Ref_seq AATTTGGGCTACT >Query_seq AAATTCGGCTACA

I would then like to generate an output such as:

3A 6C 12A

Can anyone help me figure out how to do so? General comments on better/easier ways to do this are also greatly appreciated.

Replies are listed 'Best First'.
Re: BioPerl: Annotate mismatches in an alignment
by johngg (Canon) on Oct 02, 2014 at 22:59 UTC

    I've no idea about the Bio stuff but here's a way to compare the alignments by XOR'ing them. Identical characters XOR'ed together result in NULL (\x00) so look for non-NULLs to spot where they differ. I use a regex match with a zero-width look-ahead so that pos finds that the regex pointer is looking at the mis-aligned character, not the one behind it. Offsets in strings are zero-based in Perl and note that your last mis-alignment is at character 13, not 12.

    $ perl -Mstrict -Mwarnings -E ' my $ref = q{AATTTGGGCTACT}; my $qry = q{AAATTCGGCTACA}; my $xor = $ref ^ $qry; my @offsets; push @offsets, pos $xor while $xor =~ m{(?=[^\x00])}g; say join q{ }, map { ( $_ + 1 ) . substr $qry, $_, 1 } @offsets;' 3A 6C 13A $

    I hope this is helpful

    Cheers,

    JohnGG

      Thanks for the comments, JohnGG. I see how this could work, but am still stuck in how to get the individual sequences out of the AlignIO object.

      Cheers,
      r36atd