in reply to Merging Two Strings

I'm sorry, but I was confused by your last example. Why shouldn't the alignment be as follows for $str3 and $str4:
my $str3 = 'ATGATG'; my $str4 = 'ATGATG'; $merged_result34 = 'ATGATG';
How would you want to resolve the following pairs?
ATGGTAC # match on GTA, or on ATG? CCGTAATG ACTGACTCGACT # match on ACTG or GACT? CAGACTGCC
There are probably other "boundary condition" cases that I can't think of... Anyway, I would highly recommend that you do a careful study of Algorithm::Diff -- there's a reasonble chance that you can use it to good effect here. It supports any degree of intricate and detailed tweaking of the basic "diff"-style comparison of two strings; very cool and very powerful.

And I hope some of the monks who work actively in the BioMed area will point you to the specific modules and web sites for this sort of stuff. If not, try a SuperSearch for that (I know I've seen references to such things here) and/or go to cpan.org and just put "Bio" in the search window there.

Replies are listed 'Best First'.
Re^2: Merging Two Strings
by monkfan (Curate) on Oct 27, 2005 at 06:47 UTC
    Hi graff,
    Thanks for the response. I also need to mention that the string input presuppose that they come in order. Following your example above, given this:
    $s1 = 'ATGGTAC'; $s2 = 'CCGTAATG'; $result12 = merged_function($s1,$2); # Should return: "ATGGTACCGTAATG'
    But if it comes like this:
    $s1 = 'CCGTAATG'; $s2 = 'ATGGTAC'; $result12 = merged_function($s1,$2); # Should return: "CCGTAATGGTAC'

    Regards,
    Edward
      Okay, I think I get it now -- though I still don't understand why "result34" in the OP should have come out your way rather than my way (given that the two input strings were completely identical).

      Anyway, it seems like the algorithm you're after is better described as: align two strings s1,s2 such that a maximal final substring of s1 is an exact match to the initial substring of s2, and return the union of the two strings -- i.e. the common part preceded by the initial unmatched part (if any) of s1 and followed by the final unmatched part (if any) of s2.

      Given that description, I think it could be simple: start with the first character of s2 aligned to the last character of s1, and shift the alignment one slot at a time toward the beginning of s1. At each iteration, if all overlapping characters match, store the resulting "union" string. At the point where the end of s1 is aligned with the end of s2, you're done -- the last union you stored was the best match possible.

      (Or maybe I still really don't get it?)