Re: Merging Two Strings
by BrowserUk (Patriarch) on Oct 27, 2005 at 07:09 UTC
|
Like others, I think you are going to have to clarify the rules, or pass a third parameter that indicates the initial offset. This comes close to matching your examples, but it falls down on your second. If the first character of the second string matches the last of the first (char/string respectively), then the overlap of 1 will always win.
P:\test\Vector maps>p1
perl> sub merge{
my( $s1, $s2 ) = @_;
my $i=1;
$i++ until substr( $s1, -$i ) eq substr( $s2, 0, $i );
return $s1 . substr( $s2, $i );
};;
perl> print merge( 'ATTTA', 'TTTAA' );;
ATTTAA
perl> print merge( 'ATGTA', 'ATGTA' );;
ATGTATGTA
perl> print merge( 'ATGATG', 'ATGATG' );;
ATGATGATG
perl> print merge( 'ATGGTAC', 'CCGTAATG' );;
ATGGTACCGTAATG
Your addendum about order still doesn't unambiguously resolve the requirement.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
| [reply] |
Re: Merging Two Strings
by graff (Chancellor) on Oct 27, 2005 at 06:36 UTC
|
I'm sorry, but I was confused by your last example. Why shouldn't the alignment be as follows for $str3 and $str4:
my $str3 = 'ATGATG';
my $str4 = 'ATGATG';
$merged_result34 = 'ATGATG';
How would you want to resolve the following pairs?
ATGGTAC # match on GTA, or on ATG?
CCGTAATG
ACTGACTCGACT # match on ACTG or GACT?
CAGACTGCC
There are probably other "boundary condition" cases that I can't think of... Anyway, I would highly recommend that you do a careful study of Algorithm::Diff -- there's a reasonble chance that you can use it to good effect here. It supports any degree of intricate and detailed tweaking of the basic "diff"-style comparison of two strings; very cool and very powerful.
And I hope some of the monks who work actively in the BioMed area will point you to the specific modules and web sites for this sort of stuff. If not, try a SuperSearch for that (I know I've seen references to such things here) and/or go to cpan.org and just put "Bio" in the search window there. | [reply] [d/l] [select] |
|
|
Hi graff,
Thanks for the response. I also need to mention that the string input presuppose that they come in order. Following your example above, given this:
$s1 = 'ATGGTAC';
$s2 = 'CCGTAATG';
$result12 = merged_function($s1,$2);
# Should return: "ATGGTACCGTAATG'
But if it comes like this:
$s1 = 'CCGTAATG';
$s2 = 'ATGGTAC';
$result12 = merged_function($s1,$2);
# Should return: "CCGTAATGGTAC'
| [reply] [d/l] [select] |
|
|
Okay, I think I get it now -- though I still don't understand why "result34" in the OP should have come out your way rather than my way (given that the two input strings were completely identical).
Anyway, it seems like the algorithm you're after is better described as: align two strings s1,s2 such that a maximal final substring of s1 is an exact match to the initial substring of s2, and return the union of the two strings -- i.e. the common part preceded by the initial unmatched part (if any) of s1 and followed by the final unmatched part (if any) of s2.
Given that description, I think it could be simple: start with the first character of s2 aligned to the last character of s1, and shift the alignment one slot at a time toward the beginning of s1. At each iteration, if all overlapping characters match, store the resulting "union" string. At the point where the end of s1 is aligned with the end of s2, you're done -- the last union you stored was the best match possible.
(Or maybe I still really don't get it?)
| [reply] |
Re: Merging Two Strings
by GrandFather (Saint) on Oct 27, 2005 at 07:24 UTC
|
Do you mean that you wish to concatenate two strings, but to minimise the length of the final string by maximising the overlap between the end of the first string and the start of the second string?
In practise how long might the strings be and big may the overlap be? Is efficiency in terms of execution time a factor?
Perl is Huffman encoded by design.
| [reply] |
|
|
Dear GrandFather,
Do you mean that you wish to concatenate two strings, but to minimise the length of the final string by maximising the overlap between the end of the first string and the start of the second string?
Yes GrandFather, that's very precise.
In practise how long might the strings be and big may the overlap be?
They are very short. Not longer than 20 alphabets.
Is efficiency in terms of execution time a factor?
No.
| [reply] |
|
|
Face it, monkfan: one way or another, your examples as originally posted do not lead to a consistent description of the task -- they are not compatible with one another.
Please look them over carefully, figure out which example needs to be fixed, and update your post to clearly mark (not delete) the original example that was wrong; also provide a corrected version of that example, and mark it clearly as an update.
Otherwise, some of us will be inclined to downvote you.
| [reply] |
|
|
| [reply] |
|
|
Re: Merging Two Strings
by Skeeve (Parson) on Oct 27, 2005 at 07:48 UTC
|
The regular expression s/(.+)-\1/$1/ will solve it almost. It won't satisfy (yet) your "last-match" rule
#!/usr/bin/perl
use strict;
use warnings;
my %exampledata= (
ATTTA => 'TTTAA',
ATGTA => 'ATGTA',
ATGATG => 'ATGATG',
);
while (my($s1, $s2)= each %exampledata) {
print $s1,'+',$s2,' => ',merge_data($s1, $s2),"\n";
}
(first attempt in readmore)
Update: This will take care of the rule (in readmore)
Update: This will take care of example 2 by inserting an arbitrary rule.
Rules are now:
Find all matches and take
a) the second best match that's longer than 1 character
b) the best match
sub merge_data {
local($_)= join("-", @_);
my($m_str)= $_;
if (s/(.+)-\1/$1/) {
my $first_result= $_;
my $chop_off= length($_[0])-length($1)+1;
$_= substr($m_str, $chop_off);
if (s/(.{2,})-\1/$1/) {
return substr($m_str, 0, $chop_off) . $_;
}
return $first_result;
}
return $_;
}
s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
| [reply] [d/l] [select] |
Re: Merging Two Strings
by GrandFather (Saint) on Oct 27, 2005 at 09:59 UTC
|
use warnings;
use strict;
while (<DATA>)
{
chomp;
my $first = $_;
my $second = <DATA>;
last if ! defined $second;
chomp $second;
(print "Degenerate case: $first, $second\n"), next
if length ($first) == 1 or length ($second) == 1;
my $overlap = 0;
while ($overlap < length $first)
{
my $tail = substr $first, -$overlap - 1;
my $match = $tail ^ substr $second, 0, $overlap + 1;
last if $overlap > 1 && $match =~ /^\0+$/;
++$overlap;
}
my $result = $first . substr $second, $overlap + 1;
print "Merge $first and $second as $result\n";
}
__DATA__
A
ATG
ATTTA
TTTAA
ATGTA
ATGTA
ATGATG
ATGATG
Prints:
Merge ATTTA and TTTAA as ATTTAA
Merge ATGTA and ATGTA as ATGTA
Merge ATGATG and ATGATG as ATGATGATG
Update for sauoq's test case
Perl is Huffman encoded by design.
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
|
| [reply] |
Re: Merging Two Strings
by Zaxo (Archbishop) on Oct 27, 2005 at 06:54 UTC
|
Under the "last merge" rule of example 3, shouldn't example 2 give 'ATGTATGTA' and example 1 give 'TTTAATTTA'? The rule seems murky.
Is the sequence required to start with $str1? I see you just affirmed this.
I'd approach this by comparing substr's for equality in a loop over offsets, lasting out of the loop at first match.
| [reply] |
Re: Merging Two Strings
by sauoq (Abbot) on Oct 27, 2005 at 10:33 UTC
|
If $s1 is equal with $s2, the merged region is $s1.
Your 2nd example follows this spec but your 3rd example doesn't. Either your spec and your 2nd example are wrong or your 3rd example is. I think that's the real source of the ambiguity everyone has been pointing out to you. If I had to guess, I'd say your 3rd example is wrong.
Assuming that's the case, I think the following will work:
sub merge {
my ($s1, $s2) = @_;
my $pos = length($s1) <= length($s2) ? 0 : length($s1) - length($s
+2);
$pos++ while substr($s1, $pos) ne substr($s2, 0, length($s1) - $po
+s);
substr($s1,$pos) = $s2;
return $s1;
}
It assumes that, in the event there is no overlap, the merged string is the concatenation of the two strings. (E.g.
merge("ACTG","TCAG") eq "ACTGTCAG") It also assumes that the second string will never be contained in the first unless it is a tail substring. (E.g. merge("GATC", "AT") is "GATCAT" and not "GATC" but merge("GATC", "TC) is "GATC".)
Here's how it handles your examples:
$ merge.pl ATTTA TTTAA
ATTTAA
$ merge.pl ATGTA ATGTA
ATGTA
$ merge.pl ATGATG ATGATG
ATGATG
Note in particular how it handles the your third example as it is different than the answer you provided.
-sauoq
"My two cents aren't worth a dime.";
| [reply] [d/l] [select] |
|
|
P:\test>p1
perl> sub merge{
my( $s1, $s2 ) = @_;
my $i= length $s1;
$i-- until substr( $s1, -$i ) eq substr( $s2, 0, $i );
return $s1 . substr( $s2, $i );
};;
perl> print merge( 'ATTTA', 'TTTAA' );;
ATTTAA
perl> print merge( 'ATGTA', 'ATGTA' );;
ATGTA
perl> print merge( 'ATGATG', 'ATGATG' );;
ATGATG
perl> print merge( 'ATGGTAC', 'CCGTAATG' );;
ATGGTACCGTAATG
But that still doesn't match the OPs solution to the 3rd example.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Merging Two Strings
by Roy Johnson (Monsignor) on Oct 27, 2005 at 12:52 UTC
|
my $str3 = 'ATGATG';
my $str4 = 'ATGATG';
my $catted = "$str3 $str4";
$catted =~ s/(.*)(.+) (?=\2)/$1/;
print "$catted\n";
Make the + a {2,} to specify 2 or more character minimum overlap.
Caution: Contents may have been coded under pressure.
| [reply] [d/l] [select] |
|
|
For one, if there is no overlap, that'll return a string with a space in it. For another, he did say he wanted the maximal overlap with the beginning of string 2. Your's gives the minimal.
-sauoq
"My two cents aren't worth a dime.";
| [reply] |
|
|
Those are spec issues, not defects in my solution. There's no spec saying what happens if there's no overlap, so how is it wrong to return a string with a space in it? It's better than starting a game of nethack or deleting all the files on the user's disk.
The maximal overlap rule conflicts with the last-overlap rule. Doing maximal overlap is easier than what I posted (you remove the first capture group and substitute nothing in). And changing the + to a * ensures that the space gets removed, even if the overlap is empty. Does this conform to your understanding of the specs?
$catted =~ s/(.*) (?=\1)//;
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
|
|
|
|
|
|