Merging Two Strings

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Merging Two Strings by BrowserUk (Patriarch) on Oct 27, 2005 at 07:09 UTC
Like others, I think you are going to have to clarify the rules, or pass a third parameter that indicates the initial offset. This comes close to matching your examples, but it falls down on your second. If the first character of the second string matches the last of the first (char/string respectively), then the overlap of 1 will always win. `P:\test\Vector maps>p1 perl> sub merge{ my( $s1, $s2 ) = @_; my $i=1; $i++ until substr( $s1, -$i ) eq substr( $s2, 0, $i ); return $s1 . substr( $s2, $i ); };; perl> print merge( 'ATTTA', 'TTTAA' );; ATTTAA perl> print merge( 'ATGTA', 'ATGTA' );; ATGTATGTA perl> print merge( 'ATGATG', 'ATGATG' );; ATGATGATG perl> print merge( 'ATGGTAC', 'CCGTAATG' );; ATGGTACCGTAATG` [download] Your addendum about order still doesn't unambiguously resolve the requirement. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Merging Two Strings by monarch (Priest) on Oct 27, 2005 at 07:14 UTC
The most important part of programming is not the coding. It is working out what you want to achieve. The want is key to thinking about how.	[reply]
Re: Merging Two Strings by graff (Chancellor) on Oct 27, 2005 at 06:36 UTC
I'm sorry, but I was confused by your last example. Why shouldn't the alignment be as follows for $str3 and $str4: `my $str3 = 'ATGATG'; my $str4 = 'ATGATG'; $merged_result34 = 'ATGATG';` [download] How would you want to resolve the following pairs? `ATGGTAC # match on GTA, or on ATG? CCGTAATG ACTGACTCGACT # match on ACTG or GACT? CAGACTGCC` [download] There are probably other "boundary condition" cases that I can't think of... Anyway, I would highly recommend that you do a careful study of Algorithm::Diff -- there's a reasonble chance that you can use it to good effect here. It supports any degree of intricate and detailed tweaking of the basic "diff"-style comparison of two strings; very cool and very powerful. And I hope some of the monks who work actively in the BioMed area will point you to the specific modules and web sites for this sort of stuff. If not, try a SuperSearch for that (I know I've seen references to such things here) and/or go to cpan.org and just put "Bio" in the search window there.	[reply] [d/l] [select]
Re^2: Merging Two Strings by monkfan (Curate) on Oct 27, 2005 at 06:47 UTC
Hi graff, Thanks for the response. I also need to mention that the string input presuppose that they come in order. Following your example above, given this: `$s1 = 'ATGGTAC'; $s2 = 'CCGTAATG'; $result12 = merged_function($s1,$2); # Should return: "ATGGTACCGTAATG'` [download] But if it comes like this: `$s1 = 'CCGTAATG'; $s2 = 'ATGGTAC'; $result12 = merged_function($s1,$2); # Should return: "CCGTAATGGTAC'` [download] Regards, Edward	[reply] [d/l] [select]
Re^3: Merging Two Strings by graff (Chancellor) on Oct 27, 2005 at 07:21 UTC
Okay, I think I get it now -- though I still don't understand why "result34" in the OP should have come out your way rather than my way (given that the two input strings were completely identical). Anyway, it seems like the algorithm you're after is better described as: align two strings s1,s2 such that a maximal final substring of s1 is an exact match to the initial substring of s2, and return the union of the two strings -- i.e. the common part preceded by the initial unmatched part (if any) of s1 and followed by the final unmatched part (if any) of s2. Given that description, I think it could be simple: start with the first character of s2 aligned to the last character of s1, and shift the alignment one slot at a time toward the beginning of s1. At each iteration, if all overlapping characters match, store the resulting "union" string. At the point where the end of s1 is aligned with the end of s2, you're done -- the last union you stored was the best match possible. (Or maybe I still really don't get it?)	[reply]
Re: Merging Two Strings by GrandFather (Saint) on Oct 27, 2005 at 07:24 UTC
Do you mean that you wish to concatenate two strings, but to minimise the length of the final string by maximising the overlap between the end of the first string and the start of the second string? In practise how long might the strings be and big may the overlap be? Is efficiency in terms of execution time a factor? Perl is Huffman encoded by design.	[reply]
Re^2: Merging Two Strings by monkfan (Curate) on Oct 27, 2005 at 07:35 UTC
Dear GrandFather, Do you mean that you wish to concatenate two strings, but to minimise the length of the final string by maximising the overlap between the end of the first string and the start of the second string? Yes GrandFather, that's very precise. In practise how long might the strings be and big may the overlap be? They are very short. Not longer than 20 alphabets. Is efficiency in terms of execution time a factor? No. Regards, Edward	[reply]
Re^3: Merging Two Strings by graff (Chancellor) on Oct 27, 2005 at 07:55 UTC
Face it, monkfan: one way or another, your examples as originally posted do not lead to a consistent description of the task -- they are not compatible with one another. Please look them over carefully, figure out which example needs to be fixed, and update your post to clearly mark (not delete) the original example that was wrong; also provide a corrected version of that example, and mark it clearly as an update. Otherwise, some of us will be inclined to downvote you.	[reply]
Re^3: Merging Two Strings by GrandFather (Saint) on Oct 27, 2005 at 07:45 UTC
However that doesn't quite work with your third example. Lets try again: You wish to concatenate two strings by overlaping the end of the first string and the start of the second string by the minimum distance that they match? Perl is Huffman encoded by design.	[reply]
Re^4: Merging Two Strings by Skeeve (Parson) on Oct 27, 2005 at 08:03 UTC
Re: Merging Two Strings by Skeeve (Parson) on Oct 27, 2005 at 07:48 UTC
The regular expression `s/(.+)-\1/$1/` will solve it almost. It won't satisfy (yet) your "last-match" rule `#!/usr/bin/perl use strict; use warnings; my %exampledata= ( ATTTA => 'TTTAA', ATGTA => 'ATGTA', ATGATG => 'ATGATG', ); while (my($s1, $s2)= each %exampledata) { print $s1,'+',$s2,' => ',merge_data($s1, $s2),"\n"; }` [download] (first attempt in readmore) Read more... (299 Bytes) Update: This will take care of the rule (in readmore) Read more... (581 Bytes) Update: This will take care of example 2 by inserting an arbitrary rule. Rules are now: Find all matches and take a) the second best match that's longer than 1 character b) the best match `sub merge_data { local($_)= join("-", @_); my($m_str)= $_; if (s/(.+)-\1/$1/) { my $first_result= $_; my $chop_off= length($_[0])-length($1)+1; $_= substr($m_str, $chop_off); if (s/(.{2,})-\1/$1/) { return substr($m_str, 0, $chop_off) . $_; } return $first_result; } return $_; }` [download] `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: Merging Two Strings by GrandFather (Saint) on Oct 27, 2005 at 09:59 UTC
This seems to do what you want: use warnings; use strict; while (<DATA>) { chomp; my $first = $_; my $second = <DATA>; last if ! defined $second; chomp $second; (print "Degenerate case: $first, $second\n"), next if length ($first) == 1 or length ($second) == 1; my $overlap = 0; while ($overlap < length $first) { my $tail = substr $first, -$overlap - 1; my $match = $tail ^ substr $second, 0, $overlap + 1; last if $overlap > 1 && $match =~ /^\0+$/; ++$overlap; } my $result = $first . substr $second, $overlap + 1; print "Merge $first and $second as $result\n"; } __DATA__ A ATG ATTTA TTTAA ATGTA ATGTA ATGATG ATGATG [download] Prints: `Merge ATTTA and TTTAA as ATTTAA Merge ATGTA and ATGTA as ATGTA Merge ATGATG and ATGATG as ATGATGATG` [download] Update for sauoq's test case Perl is Huffman encoded by design.	[reply] [d/l] [select]
Re^2: Merging Two Strings by sauoq (Abbot) on Oct 27, 2005 at 10:47 UTC
This seems to do what you want Somehow I doubt that he wants this: `Merge A and ATG as AG` -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re^3: Merging Two Strings by GrandFather (Saint) on Oct 27, 2005 at 18:32 UTC
Thanks for the test case. Any other good ones that look like failures? Perl is Huffman encoded by design.	[reply]
Re: Merging Two Strings by Zaxo (Archbishop) on Oct 27, 2005 at 06:54 UTC
Under the "last merge" rule of example 3, shouldn't example 2 give '`ATGTATGTA`' and example 1 give '`TTTAATTTA`'? The rule seems murky. Is the sequence required to start with `$str1`? I see you just affirmed this. I'd approach this by comparing substr's for equality in a loop over offsets, lasting out of the loop at first match. After Compline, Zaxo	[reply]
Re: Merging Two Strings by sauoq (Abbot) on Oct 27, 2005 at 10:33 UTC
If `$s1` is equal with `$s2`, the merged region is `$s1`. Your 2nd example follows this spec but your 3rd example doesn't. Either your spec and your 2nd example are wrong or your 3rd example is. I think that's the real source of the ambiguity everyone has been pointing out to you. If I had to guess, I'd say your 3rd example is wrong. Assuming that's the case, I think the following will work: `sub merge { my ($s1, $s2) = @_; my $pos = length($s1) <= length($s2) ? 0 : length($s1) - length($s +2); $pos++ while substr($s1, $pos) ne substr($s2, 0, length($s1) - $po +s); substr($s1,$pos) = $s2; return $s1; }` [download] It assumes that, in the event there is no overlap, the merged string is the concatenation of the two strings. (E.g. `merge("ACTG","TCAG") eq "ACTGTCAG"`) It also assumes that the second string will never be contained in the first unless it is a tail substring. (E.g. `merge("GATC", "AT")` is `"GATCAT"` and not `"GATC"` but `merge("GATC", "TC)` is `"GATC"`.) Here's how it handles your examples: `$ merge.pl ATTTA TTTAA ATTTAA $ merge.pl ATGTA ATGTA ATGTA $ merge.pl ATGATG ATGATG ATGATG` [download] Note in particular how it handles the your third example as it is different than the answer you provided. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
Re^2: Merging Two Strings by BrowserUk (Patriarch) on Oct 27, 2005 at 16:05 UTC
Given the updated OP, a minor mod to my first attempt seems to work? `P:\test>p1 perl> sub merge{ my( $s1, $s2 ) = @_; my $i= length $s1; $i-- until substr( $s1, -$i ) eq substr( $s2, 0, $i ); return $s1 . substr( $s2, $i ); };; perl> print merge( 'ATTTA', 'TTTAA' );; ATTTAA perl> print merge( 'ATGTA', 'ATGTA' );; ATGTA perl> print merge( 'ATGATG', 'ATGATG' );; ATGATG perl> print merge( 'ATGGTAC', 'CCGTAATG' );; ATGGTACCGTAATG` [download] But that still doesn't match the OPs solution to the 3rd example. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Merging Two Strings by Roy Johnson (Monsignor) on Oct 27, 2005 at 12:52 UTC
This oughta do it. `my $str3 = 'ATGATG'; my $str4 = 'ATGATG'; my $catted = "$str3 $str4"; $catted =~ s/(.)(.+) (?=\2)/$1/; print "$catted\n";` [download] Make the `+` a `{2,}` to specify 2 or more character minimum overlap. Caution:* Contents may have been coded under pressure.	[reply] [d/l] [select]
Re^2: Merging Two Strings by sauoq (Abbot) on Oct 27, 2005 at 15:59 UTC
For one, if there is no overlap, that'll return a string with a space in it. For another, he did say he wanted the maximal overlap with the beginning of string 2. Your's gives the minimal. -sauoq "My two cents aren't worth a dime.";	[reply]
Re^3: Merging Two Strings by Roy Johnson (Monsignor) on Oct 27, 2005 at 16:24 UTC
Those are spec issues, not defects in my solution. There's no spec saying what happens if there's no overlap, so how is it wrong to return a string with a space in it? It's better than starting a game of nethack or deleting all the files on the user's disk. The maximal overlap rule conflicts with the last-overlap rule. Doing maximal overlap is easier than what I posted (you remove the first capture group and substitute nothing in). And changing the + to a * ensures that the space gets removed, even if the overlap is empty. Does this conform to your understanding of the specs? `$catted =~ s/(.) (?=\1)//;` [download] Caution:* Contents may have been coded under pressure.	[reply] [d/l]
Re^4: Merging Two Strings by GrandFather (Saint) on Oct 27, 2005 at 19:43 UTC
Re^5: Merging Two Strings by Roy Johnson (Monsignor) on Oct 27, 2005 at 19:47 UTC
Re^4: Merging Two Strings by sauoq (Abbot) on Oct 27, 2005 at 21:49 UTC