Re: Substring Distance Problem
by Roy Johnson (Monsignor) on Apr 08, 2005 at 12:11 UTC
|
Using vec leads to a rather elegant solution:
sub score {
my ($str, $array) = @_;
my $vec = '';
for (@$array) {
my $idx = index $str, $_;
# Set bits at each matched location
vec($vec, $_, 1)= 1 for $idx..$idx+length($_)-1;
}
# Count set bits
unpack '%32b*', $vec;
}
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
|
|
Hi Roy,
Your 'vec' approach is 5 times faster than
our original approach!
Rate bare vec
bare 24142/s -- -78%
vec 110542/s 358% --
How can I accomodate the duplicate case
(please see update above) to your snippet?
i.e. Given: my @ar4 = ('GG','GG'); Returns: 4 not 2
Update:
Thanks to everybody for their great insights and helps.
| [reply] [d/l] |
|
|
use strict;
sub score {
my ($str, $array) = @_;
my $vec = '';
for (@$array) {
my $ofs = 0;
while ( ( my $idx = index $str, $_, $ofs ) > -1 ) {
# Set bits at each matched location
vec($vec, $_, 1)= 1 for $idx..$idx+length($_)-1;
$ofs = $idx + 1;
}
}
# Count set bits
unpack '%32b*', $vec;
}
sub score_xor {
my ($str, $array) = @_;
my $vec = "\0" x length($str);
for (@$array) {
my $ofs = 0;
while ( ( my $idx = index $str, $_, $ofs ) > -1 ) {
# Matching substrings are padded into position with nulls
$vec |= ("\0" x $idx) . $_;
$ofs = $idx + 1;
}
}
# Matching characters become nulls; others non-nulls
$vec ^= $str;
# Count nulls
$vec =~ tr/\0//;
}
I was curious to see how these two versions compared, and was surprised to learn that the original one is faster by well over a factor of 2:
Rate xor v1
xor 112734/s -- -59%
v1 275692/s 145% --
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Substring Distance Problem
by hv (Prior) on Apr 08, 2005 at 11:05 UTC
|
sub score{
my($str, $array) = @_;
# ensure that longer string comes before its prefix
my @substring = sort { $b cmp $a } @$array;
my $re = join '|', map "(?=($_))", @substring;
my($count, $next) = (0, 0);
while ($str =~ /$re/g) {
# $-[0] is the position at which we matched
# @- describes the matched captures, so $#- is the actual capture
+matched
my($start, $which) = ($-[0], $#- - 1);
my $end = $start + length($substring[$which]);
$next = $start if $next < $start;
next if $end < $next;
$count += $end - $next;
$next = $end;
}
return $count;
}
This assumes that you may want substrings of varying lengths within the array, and that you may even have one substring in a set that is an exact prefix of another - if you don't need to allow for one or both of those possibilities, the code could be simplified a bit further.
It does assume however that the substrings are simple strings to match directly rather than regexps in their own right: it would otherwise need a different approach to discovering the length of each match.
The idea is to construct a regexp that will match any of the strings at any position by turning each into a lookahead; and to make each substring a capture so that we know which matched, and can therefore work out the length of the match.
The rest of the code remembers what positions have already been catered for to avoid double-counting.
Hugo | [reply] [d/l] |
Re: Substring Distance Problem
by BrowserUk (Patriarch) on Apr 08, 2005 at 11:13 UTC
|
This produces the required results for the testcases provided reasonably efficiently. Generalising the implementation is left as an exercise.
Basically, you replace spaces with nulls, OR the n-1 shorter strings together and then XOR the result with the longest string. You then count the number of nulls in the result.
The definition of "shorter" and "longest in that description is fuzzy :).
#! perl -slw
use strict;
my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA';
my $s2 = 'GATTACG GCGCTCG AACGGCA';
my $masked = $s1 ^ $s2;
print scalar $masked =~ tr[\0][0];
my $s3 = 'GATTACGAGTGGCGCTCGTGTAACGGCA';
my $s4 = 'GATTACG ';
my $s5 = ' TTACGAG CGTGTAA ';
tr[ ][\0] for $s3, $s4, $s5;
$masked = $s3 ^ ( $s4 | $s5 );
print scalar $masked =~ tr[\0][0];
my $s6 = ' GCTCGTG ';
my $s7 = 'GATTACGAGTGGCGCTCGTGTAACGGCA';
my $s8 = ' TACGAGT ';
my $s9 = ' GTGGCGC ';
tr[ ][\0] for $s6, $s7, $s8, $s9;
$masked = $s7 ^ ( $s6 | $s8 | $s9 );
print scalar $masked =~ tr[\0][0];
__END__
P:\test>junk2
21
16
17
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?
| [reply] [d/l] |
|
|
I was going to include the generalized version of your solution with my vec solution, as they're functionally fairly similar, but I couldn't get it to work. Finally found my mistake, so here's a solution to the exercise:
sub score {
my ($str, $array) = @_;
my $vec = "\0" x length($str);
for (@$array) {
my $idx = index $str, $_;
# Matching substrings are padded into position with nulls
$vec |= ("\0" x $idx) . $_;
}
# Matching characters become nulls; others non-nulls
$vec ^= $str;
# Count nulls
$vec =~ tr/\0//;
}
Update:
An interesting (possibly quite useful) thing for the OP to note is that the vec solution effectively builds the list of dots as its vector (ones in matched positions), and your solution gives one string with the actual matched characters in position.
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
|
|
In my understanding, ewijaya should automate the substring positioning as well as counting - what I think you have left as an exercise. This part would probably spoil your efficiency claim, because you're solving the problem from quite a convenient starting point :)
This said, I find the final computation quite elegant and istructive - I wouldn't have thought of XORing letters even in 100 years. There's always to learn, luckly, provided I'll be able to remember it when I'll need :)
Flavio (perl -e "print(scalar(reverse('ti.xittelop@oivalf')))")
Don't fool yourself.
| [reply] |
|
|
This said, I find the final computation quite elegant and istructive - I wouldn't have thought of XORing letters even in 100 years.
BrowserUk is The XOR Meister.
I had the same reaction as yours when I first encountered the "infamous xor trick" (in this case to find the first position at which two strings differ, or equivalently, the length of the longest common prefix):
$a = "foobar";
$b = "foobAr";
($a ^ $b) =~ /^(\0*)/ and print length $1;
__END__
4
Way cool, though it works as written only if the characters are 1 byte long.
There's always to learn, luckly, provided I'll be able to remember it when I'll need :)
Yep, that's the rub.
| [reply] [d/l] |
Re: Substring Distance Problem
by monkey_boy (Priest) on Apr 08, 2005 at 10:58 UTC
|
On first look, id use a hash to flag up positions that match
then just add up the total number of matches...
e.g.
my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA';
# Here are the array of the substrings, note that the number of
# substrings may differ from array to array
# and length of the substring may vary (defined as a parameter)
my @ar = ('GATTACG','GCGCTCG','AACGGCA'); #21 (0,11,21)
my @ar2 = ('GATTACG','TTACGAG','CGTGTAA'); #16
my @ar3 = ('TACGAGT','GTGGCGC','GCTCGTG'); #17
print &score($s1,\@ar) , "\n";
print &score($s1,\@ar2) , "\n";
print &score($s1,\@ar3) , "\n";
sub score
{
my ($str,$array) = @_;
my %position_score;
for my $frag (@{$array}) {
my $idx = index($s1, $frag) + 1;
for my $pos ($idx .. $idx + (length($frag) - 1)) {
$position_score{$pos} = 1;
};
};
my $total_score = 0;
for my $score (values %position_score) {
$total_score += $score;
};
return $total_score;
};
This is not a Signature...
| [reply] [d/l] |
Re: Substring Distance Problem
by monkey_boy (Priest) on Apr 08, 2005 at 11:04 UTC
|
| [reply] [d/l] |
|
|
Hi monkey_boy,
Thanks so much for your answer.
However, if there is a repeated substring cases as you mentioned, it should be scored as 4.
Update:
With slight modification to your code -by including offset- I managed to accomodate
the duplicated case:
| [reply] [d/l] |
Re: Substring Distance Problem
by tlm (Prior) on Apr 08, 2005 at 11:53 UTC
|
#!/usr/bin/perl -wl
use strict;
use List::Util qw( min max sum );
sub score {
my ($str, @array) = @_;
my @pos =
sort { $a->[ 0 ] <=> $b->[ 0 ] }
map [ $_->[ 0 ], $_->[ 0 ] + length $_->[ 1 ] ],
grep $_->[ 0 ] > -1,
map [ index( $str, $_ ), $_ ],
@array;
for my $i ( reverse 0 .. $#pos - 1 ) {
my $lefts = [ sort { $a <=> $b } map $_->[ 0 ], @pos[ $i, $i + 1
+] ];
my $rights = [ sort { $b <=> $a } map $_->[ 1 ], @pos[ $i, $i + 1
+] ];
if ( $rights->[ 1 ] - $lefts->[ 1 ] > 0 ) {
splice @pos, $i, 2, [ map $_->[ 0 ], $lefts, $rights ]
}
}
return sum map $_->[ 1 ] - $_->[ 0 ], @pos;
}
my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA';
my @ar = ('GATTACG','GCGCTCG','AACGGCA'); #21 (0,11,21)
my @ar2 = ('GATTACG','TTACGAG','CGTGTAA'); #16
my @ar3 = ('TACGAGT','GTGGCGC','GCTCGTG'); #17
print score( $s1, @$_ ) for \@ar, \@ar2, \@ar3;
__END__
% perl score.pl
21
16
17
| [reply] [d/l] |
Re: Substring Distance Problem
by NateTut (Deacon) on Apr 08, 2005 at 15:53 UTC
|
| [reply] |