Measuring Substrings Problem

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Substring Distance Problem by Roy Johnson (Monsignor) on Apr 08, 2005 at 12:11 UTC
Using vec leads to a rather elegant solution: `sub score { my ($str, $array) = @_; my $vec = ''; for (@$array) { my $idx = index $str, $_; # Set bits at each matched location vec($vec, $_, 1)= 1 for $idx..$idx+length($_)-1; } # Count set bits unpack '%32b', $vec; }` [download] Caution:* Contents may have been coded under pressure.	[reply] [d/l]
Re^2: Substring Distance Problem by monkfan (Curate) on Apr 09, 2005 at 04:58 UTC
Hi Roy, Your 'vec' approach is 5 times faster than our original approach! `Rate bare vec bare 24142/s -- -78% vec 110542/s 358% --` [download] How can I accomodate the duplicate case (please see update above) to your snippet? i.e. Given: my @ar4 = ('GG','GG'); Returns: 4 not 2 Update: Thanks to everybody for their great insights and helps. Regards, Edward	[reply] [d/l]
Re^3: Substring Distance Problem by tlm (Prior) on Apr 09, 2005 at 07:42 UTC
If I may tinker with Roy's code, here is the fix you need. I give the modification for both the original and the xor versions of the algorithm: use strict; sub score { my ($str, $array) = @_; my $vec = ''; for (@$array) { my $ofs = 0; while ( ( my $idx = index $str, $_, $ofs ) > -1 ) { # Set bits at each matched location vec($vec, $_, 1)= 1 for $idx..$idx+length($_)-1; $ofs = $idx + 1; } } # Count set bits unpack '%32b*', $vec; } sub score_xor { my ($str, $array) = @_; my $vec = "\0" x length($str); for (@$array) { my $ofs = 0; while ( ( my $idx = index $str, $_, $ofs ) > -1 ) { # Matching substrings are padded into position with nulls $vec \|= ("\0" x $idx) . $_; $ofs = $idx + 1; } } # Matching characters become nulls; others non-nulls $vec ^= $str; # Count nulls $vec =~ tr/\0//; } [download] I was curious to see how these two versions compared, and was surprised to learn that the original one is faster by well over a factor of 2: `Rate xor v1 xor 112734/s -- -59% v1 275692/s 145% --` [download] the lowliest monk	[reply] [d/l] [select]
Re^2: Substring Distance Problem by tlm (Prior) on Apr 08, 2005 at 16:53 UTC
Good stuff. I think it (or the xor variant, or both) would be a good addition to the Snippets Section. the lowliest monk	[reply]
Re: Substring Distance Problem by hv (Prior) on Apr 08, 2005 at 11:05 UTC
Here's a fairly simply approach: sub score{ my($str, $array) = @_; # ensure that longer string comes before its prefix my @substring = sort { $b cmp $a } @$array; my $re = join '\|', map "(?=($_))", @substring; my($count, $next) = (0, 0); while ($str =~ /$re/g) { # $-[0] is the position at which we matched # @- describes the matched captures, so $#- is the actual capture +matched my($start, $which) = ($-[0], $#- - 1); my $end = $start + length($substring[$which]); $next = $start if $next < $start; next if $end < $next; $count += $end - $next; $next = $end; } return $count; } [download] This assumes that you may want substrings of varying lengths within the array, and that you may even have one substring in a set that is an exact prefix of another - if you don't need to allow for one or both of those possibilities, the code could be simplified a bit further. It does assume however that the substrings are simple strings to match directly rather than regexps in their own right: it would otherwise need a different approach to discovering the length of each match. The idea is to construct a regexp that will match any of the strings at any position by turning each into a lookahead; and to make each substring a capture so that we know which matched, and can therefore work out the length of the match. The rest of the code remembers what positions have already been catered for to avoid double-counting. Hugo	[reply] [d/l]
Re: Substring Distance Problem by BrowserUk (Patriarch) on Apr 08, 2005 at 11:13 UTC
This produces the required results for the testcases provided reasonably efficiently. Generalising the implementation is left as an exercise. Basically, you replace spaces with nulls, OR the n-1 shorter strings together and then XOR the result with the longest string. You then count the number of nulls in the result. The definition of "shorter" and "longest in that description is fuzzy :). #! perl -slw use strict; my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my $s2 = 'GATTACG GCGCTCG AACGGCA'; my $masked = $s1 ^ $s2; print scalar $masked =~ tr[\0][0]; my $s3 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my $s4 = 'GATTACG '; my $s5 = ' TTACGAG CGTGTAA '; tr[ ][\0] for $s3, $s4, $s5; $masked = $s3 ^ ( $s4 \| $s5 ); print scalar $masked =~ tr[\0][0]; my $s6 = ' GCTCGTG '; my $s7 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my $s8 = ' TACGAGT '; my $s9 = ' GTGGCGC '; tr[ ][\0] for $s6, $s7, $s8, $s9; $masked = $s7 ^ ( $s6 \| $s8 \| $s9 ); print scalar $masked =~ tr[\0][0]; __END__ P:\test>junk2 21 16 17 [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l]
Re^2: Substring Distance Problem by Roy Johnson (Monsignor) on Apr 08, 2005 at 12:30 UTC
I was going to include the generalized version of your solution with my vec solution, as they're functionally fairly similar, but I couldn't get it to work. Finally found my mistake, so here's a solution to the exercise: `sub score { my ($str, $array) = @_; my $vec = "\0" x length($str); for (@$array) { my $idx = index $str, $_; # Matching substrings are padded into position with nulls $vec \|= ("\0" x $idx) . $_; } # Matching characters become nulls; others non-nulls $vec ^= $str; # Count nulls $vec =~ tr/\0//; }` [download] Update: An interesting (possibly quite useful) thing for the OP to note is that the vec solution effectively builds the list of dots as its vector (ones in matched positions), and your solution gives one string with the actual matched characters in position. Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: Substring Distance Problem by polettix (Vicar) on Apr 08, 2005 at 12:01 UTC
In my understanding, ewijaya should automate the substring positioning as well as counting - what I think you have left as an exercise. This part would probably spoil your efficiency claim, because you're solving the problem from quite a convenient starting point :) This said, I find the final computation quite elegant and istructive - I wouldn't have thought of XORing letters even in 100 years. There's always to learn, luckly, provided I'll be able to remember it when I'll need :) Flavio (perl -e "print(scalar(reverse('ti.xittelop@oivalf')))") Don't fool yourself.	[reply]
Re^3: Substring Distance Problem by tlm (Prior) on Apr 08, 2005 at 14:39 UTC
This said, I find the final computation quite elegant and istructive - I wouldn't have thought of XORing letters even in 100 years. BrowserUk is The XOR Meister. I had the same reaction as yours when I first encountered the "infamous xor trick" (in this case to find the first position at which two strings differ, or equivalently, the length of the longest common prefix): `$a = "foobar"; $b = "foobAr"; ($a ^ $b) =~ /^(\0)/ and print length $1; __END__ 4` [download] Way cool, though it works as written only if the characters are 1 byte long. There's always to learn, luckly, provided I'll be able to remember it when I'll need :)* Yep, that's the rub. the lowliest monk	[reply] [d/l]
Re: Substring Distance Problem by monkey_boy (Priest) on Apr 08, 2005 at 10:58 UTC
On first look, id use a hash to flag up positions that match then just add up the total number of matches... e.g. my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; # Here are the array of the substrings, note that the number of # substrings may differ from array to array # and length of the substring may vary (defined as a parameter) my @ar = ('GATTACG','GCGCTCG','AACGGCA'); #21 (0,11,21) my @ar2 = ('GATTACG','TTACGAG','CGTGTAA'); #16 my @ar3 = ('TACGAGT','GTGGCGC','GCTCGTG'); #17 print &score($s1,\@ar) , "\n"; print &score($s1,\@ar2) , "\n"; print &score($s1,\@ar3) , "\n"; sub score { my ($str,$array) = @_; my %position_score; for my $frag (@{$array}) { my $idx = index($s1, $frag) + 1; for my $pos ($idx .. $idx + (length($frag) - 1)) { $position_score{$pos} = 1; }; }; my $total_score = 0; for my $score (values %position_score) { $total_score += $score; }; return $total_score; }; [download] This is not a Signature...	[reply] [d/l]
Re: Substring Distance Problem by monkey_boy (Priest) on Apr 08, 2005 at 11:04 UTC
Just a thought .... What should happen when one sub-string can match multiple times? e.g. GG can match twice my above implimentation will only handle match the first time & score 2 insetead of 4. This is not a Signature...	[reply] [d/l]
Re^2: Substring Distance Problem by monkfan (Curate) on Apr 08, 2005 at 16:43 UTC
Hi monkey_boy, Thanks so much for your answer. However, if there is a repeated substring cases as you mentioned, it should be scored as 4. Update: With slight modification to your code -by including offset- I managed to accomodate the duplicated case: Read more... (684 Bytes)	[reply] [d/l]
Re: Substring Distance Problem by tlm (Prior) on Apr 08, 2005 at 11:53 UTC
I haven't tested the code below very much; you should if you think you many want to use it. #!/usr/bin/perl -wl use strict; use List::Util qw( min max sum ); sub score { my ($str, @array) = @_; my @pos = sort { $a->[ 0 ] <=> $b->[ 0 ] } map [ $_->[ 0 ], $_->[ 0 ] + length $_->[ 1 ] ], grep $_->[ 0 ] > -1, map [ index( $str, $_ ), $_ ], @array; for my $i ( reverse 0 .. $#pos - 1 ) { my $lefts = [ sort { $a <=> $b } map $_->[ 0 ], @pos[ $i, $i + 1 +] ]; my $rights = [ sort { $b <=> $a } map $_->[ 1 ], @pos[ $i, $i + 1 +] ]; if ( $rights->[ 1 ] - $lefts->[ 1 ] > 0 ) { splice @pos, $i, 2, [ map $_->[ 0 ], $lefts, $rights ] } } return sum map $_->[ 1 ] - $_->[ 0 ], @pos; } my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my @ar = ('GATTACG','GCGCTCG','AACGGCA'); #21 (0,11,21) my @ar2 = ('GATTACG','TTACGAG','CGTGTAA'); #16 my @ar3 = ('TACGAGT','GTGGCGC','GCTCGTG'); #17 print score( $s1, @$_ ) for \@ar, \@ar2, \@ar3; __END__ % perl score.pl 21 16 17 [download] the lowliest monk	[reply] [d/l]
Re: Substring Distance Problem by NateTut (Deacon) on Apr 08, 2005 at 15:53 UTC
I wonder if you might be able to make use of Text::LevenshteinXS?	[reply]