Simple String Alignment

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Revered monks,
First of all I want to apologize of asking (rather quite many) questions these last few days, if that annoys (some). Even though I've mustered my accumulated knowledge from PM, there still bound to be a problem which I am stuck. So here again, I humbly seek your illumination..

This is the problem I encounter.
I wanted to align the sequence given as array in HoA, based on the given query string.

Update: Original rule removed. See below for better explanation of how I get the answer.

My current (failing) code, which I worried my overall approach to it maybe wrong:

#!/usr/bin/perl -w
use strict;
use Data::Dumper;

my $hash = {
             'S1' => [ 'A','B','C','D','H','A' ],
             'S2' => [ 'A','C','D','B','G','J' ],
             'S3' => [ 'C', 'A', 'D', 'H','M','K' ],
             'S4' => [ 'A', 'B', 'I', 'C','I','D' ]
           };

my $seq1 = 'A C D'; # there is always a space between the query chars

#This two are for testing
my $seq2 = 'A B C';
my $seq3 = 'A B';

my @alignment = align($hash,$seq);
print Dumper \@alignment;

sub align
{
    my ($hashref,$seq) = @_;
    my @seq_ar = split(/\s/,$seq);
    my $test = join('(.*?)',@seq_ar);

    #Attempt to find how many possible gap
    my $count_stars = $test =~ tr/\*//;
    print "Stars: $count_stars\n";

    my @hyph_padded_seq;

    for my $key( sort {$a cmp $b} keys(%$hashref))
    {
        my $string = join('',@{$hashref->{$key}});
         
        # whatever strings that go here are
        # guaranteed to contain $seq

        if($string =~ $test)
        {
            my $gap_sum;

            # I may be doing something silly here
            # Here I'm trying to find the sum of the gap
            # which is supposed to be captured by
            # regex memory variable $1, $2 .. etc using loop
            
            for ( 1..$count_stars  )
            {
              # Creating memory variable $1,$2..etc
              # which doesnt' work
              # and create error

              my $mem = '$'.$_;

              # counting (chars) gap that is stored in memory variable
              my $count_gap = $mem =~ tr/[A-Z]//;
              $gap_sum += $count_gap;

              # Next what I intended to do is
              # to replace 2 corresponding consecutive chars
              # with "-" of size $1 $2 of highest $gap_sum...etc
              # then push every newly created seq with (-) 
              # into array @hyph_padded_seq

              # I really am stuck here.
            }

        }
    }
    return @hyph_padded_seq;
}
[download]

The intended answers are:

Query: "A C D"     Query: "A B C"    Query: "A B"   Query: "C D"
Answers:           Answers:          Answers:       Answers:
[                  [                 [              [
  'AB-C-D',         'ABIC',           'A--B',        'C-D',
  'A--C-D',         'AB_C'            'ACDB',        'C-D',
  'ABICID',        ]                  'A--B'         'CAD',
]                                    ]               'CID'
                                                    ]
[download]

Update: This is how I intended to get the answers

Take Query "A C D" as an example:

             'S1' => [ 'A', 'B', 'C', 'D','H','A' ],
             'S2' => [ 'A', 'C', 'D', 'B','G','J' ],
             'S3' => [ 'C', 'A', 'D', 'H','M','K' ],
             'S4' => [ 'A', 'B', 'I', 'C','I','D' ]  

1. String "A C D" can be found only in S1,S2,S4. 
   Thus, it was array from S1,S2,S4 that is taken for alignment.

2. See, S4->"ABICID" gives the biggest gap compare to S1,S2.
   As shown here:
   S4->"ABICID" gives A[BI]C[I]D, since there are 2 + 1 gaps. 
                                  2 for 'BI', 1 for 'I'
   S1->"ABCDHA" gives A[B]CD, only 1 gaps for [B]
   S2->"ACDBGJ" gives ACD,    without any gaps in between.

3. Then S1,S2 must be align *based* on S4. That means
   the maximum span of S1, S2 is until D ends.
   S1->"ABCD" 
   S2->"ACD"
   These are the two strings we align with S4->"ABICID"

4. Now, lets align S1->"ABCD" with S4->"ABICID"
   In S1, there are 1 gap in ABC in comparison to S4
   like this:
          S1->  AB-C
          S4->  ABIC
   
   There is another 1 gap in CD in comparison to S4
   like this:

          S1-> C-D
          S4-> CID

    Thus the full alignment between S1 and S4 are:

        S1-> AB-C-D
        S4-> ABICID

5. Then align S2->"ACD" with S4->"ABICID"
   In S2, there are 2 gaps in AC in comparison to S4
   like this:
          S2->  A--C
          S4->  ABIC
   
   There is another 1 gap in CD in comparison to S4
   like this:

          S2-> C-D
          S4-> CID

    Thus the full alignment between S1 and S4 are:

        S2-> A--C-D
        S4-> ABICID

6. The final alignment gives:
        S1-> AB-C-D
        S2-> A--C-D
        S4-> ABICID
[download]

Regards,
Edward

Comment on Simple String Alignment Select or Download Code

Replies are listed 'Best First'.
Re: Simple String Alignment by tlm (Prior) on May 10, 2005 at 13:14 UTC
It looks to me that all you need is the standard Needleman-Wunsch algorithm: apply it n times to align your query sequence against all the sequences in `$hash`, read off the sequence whose best alignment has the longest gap, and then use NW again to align this maximal-gap sequence against the remaining n-1 sequences in `$hash`. the lowliest monk	[reply]
Re: Simple String Alignment by hv (Prior) on May 10, 2005 at 10:36 UTC
I'm unsure what you're trying to achieve here, but as I understand it you're taking an input sequence: `"A C D"` [download] turning it into a regexp: `/A(.?)C(.?)D/` [download] matching that against an input sequence: `"ABICID"` [download] and want to replace all the characters that didn't directly match with a hyphen: `"A--C-D"` [download] If so, one possible approach is to take advantage of the special match arrays `@-` and `@+` - for example for the capture variable `$3`, `$-[3]` will be the offset of the start of that capture, and `$+[3]` the offset just after the end of that capture. So you could use something like: `if ($string =~ $test) { for my $offset (1 .. $#-) { my $start = $-[$offset]; my $len = $+[$offset] - $start; substr($test, $start, $len) = "-" x $len; } # now extract only the matched extent push @hyph_padded_seq, substr($test, $-[0], $+[0] - $-[0]); }` [download] So for the given hash, `"A C D"` would give results of `"A-CD"` for key S1, and `"A--C-D"` for key S4. That doesn't match the "intended answers" you show, but that's the closest I can get to guessing what you're looking for. Hugo	[reply] [d/l] [select]
Re: Simple String Alignment by Adrade (Pilgrim) on May 10, 2005 at 21:15 UTC
Dear Edward, This was my approach to your problem. It runs through all the sequences you give it in the @sequences array looking for the one with the biggest gaps, and kicking out any where the letter order isn't right. It populates the @in_the_running array with sequences that will work. While this is happening, it makes sure that the sequence with the biggest gap stays at the front of the array (index 0). It then prints the sequence with the biggest gap, and runs through the rest of the sequences that qualified - each sequence it compares against the one with the biggest gap, letter by letter. If it matches, it prints a letter - if it doesn't it prints a dash. This is one of those fun problems where multiple approaches could be applied - I hope mine assists you with yours! Best, -Adam @in = split(//,<STDIN>); pop(@in); $regex = '.'.join('(.)',@in).'.'; @sequences = qw/ABCDHA ACDBGJ CADHMK ABICID/; for (@sequences) { @letters = split(//,$_); s/.($in[0].$in[$#in])./$1/; if ((@gaps) = (m/$regex/o)){ if (length(join('',@gaps)) > $biglen) { unshift(@in_the_running, $ +_); $biglen = length(join('',@gaps)) } else { push(@in_the_running, $_) } } } print $in_the_running[0], "\n"; @base = split(//,$in_the_running[0]); for my $seqno (1..$#in_the_running) { @seq = split(//,$in_the_running[$seqno]); for $q (@base) { if ($seq[0] eq $q) { print $q; shift(@seq); next } else { print '-' } } print "\n"; } [download]	[reply] [d/l]
Re^2: Simple String Alignment by monkfan (Curate) on May 11, 2005 at 01:59 UTC
Adrade, Thanks so much for your reply. I've included your part in my code. Somehow your approach doesn't consider the maximum span in the given query. For example with query: "ACD" it gives `S4 ABIC S1 AB-C S2 A--C` [download] i.e it left out the "D" for alignment (kindly see answers above). In other case, given query "AB" it over-aligned them, it gives: `S4 ABICID S2 A--C-D S1 AB-C-D` [download] Is there any way I can go about it in your code. Hope to hear from you again. Update: Finally! with some tweak suggested by Adrade below. I can get the problem SOLVED! Many thanks for the rest of you, especially Adrade. Don't know what to say! For those who are interested in the final form of the code, check this out: Read more... (3 kB) Regards, Edward	[reply] [d/l] [select]
Re^3: Simple String Alignment by Adrade (Pilgrim) on May 11, 2005 at 02:51 UTC
Dear Edward, I think I have discovered the problem. In my code, I did a pop(@in) just to remove the newline that occurs when the user types in the letters. This was accidentally left in your code. If you comment out line 41, everything should work fine! I hope this helps - let me know how it turns out! Best -Adam Update: The finally (maybe :-) revised code (that should do everything correctly) can be found here. Another Update! Hehe - ok... lets hope this hits the mark: Code is here YA-Update: The correct and final code is in the above post. Yay. Be well, -Adam	[reply]
Re: Simple String Alignment by hv (Prior) on May 11, 2005 at 11:09 UTC
Now that I understand the question, here's another attempt. :) sub align { my($hash, $seq) = @_; my $best = { totlen => 0 }; my @match; my $re = join '(.*?)', ($seq =~ /(\w)/g); for my $key (sort keys %$hash) { my $string = join '', @{ $hash->{$key} }; next unless $string =~ $re; my $offset = $-[0]; my $length = $+[0] - $offset; my $match = { key => $key, matched => substr($string, $offset, $length), totlen => $length, inserts => [ map $_ - $offset, @+[1 .. $#-] ], lengths => [ map $+[$_] - $-[$_], 1 .. $#+ ], }; push @match, $match; $best = $match if $match->{totlen} > $best->{totlen}; } return [ map { my $match = $_; my $string = $match->{matched}; for (reverse 0 .. $#{ $match->{lengths} }) { my $bestlen = $best->{lengths}[$_]; my $curlen = $match->{lengths}[$_]; if ($bestlen > $curlen) { substr($string, $match->{inserts}[$_], 0) = "-" x ($bestlen - +$curlen); } } $string; } @match ]; } [download] The idea is to apply the regexp to each string only once, and capture all relevant information at that point (also recording the longest match along the way), then take a second pass through all the matched substrings to add the hyphenation. It isn't clear from the example what should happen if one of the strings to be modified already has a gap longer than that required, eg testing "ACD" against "ABICID" and "ACHHDH" - the code above would leave such a gap alone yielding `[ "ABICID", "A--CHHD" ]` [download] , but if it should instead be truncated or marked in some other way you'd need to add an `... elsif ($bestlen < $curlen) ...` chunk near the bottom. Hugo	[reply] [d/l] [select]