YuckFoo has asked for the wisdom of the Perl Monks concerning the following question:
Howdy Monks,
I am optimizing (for speed) my anagram script. I want to
optimize this 'leftover' subroutine. It returns a string
with characters that are in the first string that are
not matched in the second string. Both strings are sorted.
If the second string contains characters that can't be
matched in the first string, undef is returned.
acciimmnnnnoootu - acimnotu = cimnnnoo
acciimmnnnnoootu - acimnotuu = undef
Spliting both strings to arrays first seems a bit brutish
and expensive, but I can't find a faster way. Any ideas?
Thanks!
#!/usr/bin/perl
use strict;
my ($full, $part) = @ARGV;
my ($left) = leftover($full, $part);
if (defined($left)) { print "$full - $part = $left\n"; }
else { print "$full - $part = undef\n"; }
#-----------------------------------------------------------
sub leftover {
my ($full, $part) = @_;
my ($ch, $left);
my ($regx) = join('+.*', (split('', $part)));
if ($full =~ m{$regx}) {
my (@fulls) = split('', $full);
my (@parts) = split('', $part);
while ($ch = shift(@fulls)) {
($ch eq $parts[0]) ? shift(@parts) : ($left .= $ch);
}
}
else { $left = undef; };
return $left;
}
Re: Difference Of Two Strings
by merlyn (Sage) on Nov 03, 2001 at 05:16 UTC
|
Untested for speed, but it's functional:
sub leftover {
my ($full, $part) = @_;
my %count;
$count{$_}++ for split //, $full;
for (split //, $part) {
return undef if --$count{$_} < 0;
}
return join "", map { $_ x $count{$_} } keys %count;
}
-- Randal L. Schwartz, Perl hacker | [reply] [d/l] |
|
Winner by a nose! Thanks merlyn.
Actually it's a lot better. When I saw how quickly this
will return undef, I realized my benchmarking sucks.
I have been benching only on successful matches.
On success:
l_merlyn: 16 wallclock secs (15.60 usr + 0.00 sys = 15.60 CPU) @ 6564.10/s (n=102400)
leftover: 16 wallclock secs (15.74 usr + 0.00 sys = 15.74 CPU) @ 6505.72/s (n=102400)
On failure:
l_merlyn: 1 wallclock secs ( 1.11 usr + 0.00 sys = 1.11 CPU) @ 9225.23/s (n=10240)
leftover: 45 wallclock secs (45.12 usr + 0.00 sys = 45.12 CPU) @ 226.95/s (n=10240)
I should have remembered how slow !~ m// can be.
Definitely the optimization I was looking for!
| [reply] |
|
You might also try this one for speed, which avoids building the list unnecessarily, and uses arrays instead of hashes to avoid the hashing algorithm for just a single character:
sub leftover {
my ($full, $part) = @_;
my @count;
$count[ord $1]++ while $full =~ /(\w)/g;
while ($part =~ /(\w)/g) {
return undef if --$count[ord $1] < 0;
}
my $return;
$count[$_] and $return .= chr($_) x $count[$_] for 0..$#count;
$return;
}
-- Randal L. Schwartz, Perl hacker | [reply] [d/l] |
|
Re: Difference Of Two Strings
by blakem (Monsignor) on Nov 03, 2001 at 04:43 UTC
|
I tried to come up with a solution using tr/// since its much faster, but had to bring in the s/// operator instead...
#!/usr/bin/perl -wT
use strict;
no warnings "uninitialized"; # printing undef triggers a stupid warni
+ng
# testing loop
for my $pair (['aabbcc', 'abc'],
['aabbccdd', 'abcdd'],
['abc', 'abc'],
['abc', '123'],
) {
my $left = leftover(@$pair);
printf ("%8s - %-6s => %-5s\n",@$pair,"'$left'");
}
# scrabble "subtraction" subroutine
sub leftover {
my $string = shift;
my $letters = shift;
$string =~ s/$_// || return for split // => $letters;
return $string;
}
=OUTPUT
aabbcc - abc => 'abc'
aabbccdd - abcdd => 'abc'
abc - abc => ''
abc - 123 => ''
You'll have to be careful in your usage of this sub, since your specs state nearly opposite meanings for the return values of '' and undef.
Update: added || return to test for unmatched chars.
-Blake
| [reply] [d/l] [select] |
|
| [reply] |
|
I updated it slightly to check for the extra chars....
-Blake
| [reply] |
Re: Difference Of Two Strings (complete benchmarks)
by Fastolfe (Vicar) on Nov 03, 2001 at 05:33 UTC
|
Another implementation for you. This one seems pretty speedy (update: optimized and benchmarks updated with everyone else's updates):
sub fastolfe {
my $source = shift;
my $chop = shift;
local($_);
my %found;
$found{$_}++ while ($_ = chop($source)) ne '';
while (($_ = chop($chop)) ne '') {
return if --$found{$_} < 0;
}
my $result;
foreach (sort keys %found) {
$result .= $_ while $found{$_}--;
}
$result;
}
Hybridizing the above with merlyn's version, we can squeeze out a bit more speed:
sub fast_merl {
my ($source, $chop) = @_;
local($_);
my %found;
$found{$_}++ while ($_ = chop($source)) ne '';
while (($_ = chop($chop)) ne '') {
return if --$found{$_} < 0;
}
return join "", map { $_ x $found{$_} } keys %found;
}
Benchmarking most of the versions I see thus far (minus a couple of the less interesting ones, because these benchmarks are getting big), using test input from above (2 success, 2 fail) as well as demerphq's tests below:
# demerphq's test set
Rate blakem merlyn demq_scan demerphq fastolfe fast_merl
+fast_c scan_c
blakem 479/s -- -24% -39% -44% -46% -47%
+ -97% -98%
merlyn 629/s 31% -- -20% -27% -29% -30%
+ -96% -97%
demq_scan 790/s 65% 26% -- -8% -11% -12%
+ -95% -97%
demerphq 861/s 80% 37% 9% -- -3% -4%
+ -95% -96%
fastolfe 887/s 85% 41% 12% 3% -- -1%
+ -94% -96%
fast_merl 901/s 88% 43% 14% 5% 2% --
+ -94% -96%
fast_c 15708/s 3179% 2396% 1889% 1723% 1671% 1644%
+ -- -31%
scan_c 22648/s 4627% 3499% 2767% 2529% 2453% 2415%
+ 44% --
# simple success case
Rate blakem merlyn demerphq fastolfe fast_merl demq_scan
+fast_c scan_c
blakem 6244/s -- -14% -26% -32% -35% -43%
+ -90% -93%
merlyn 7221/s 16% -- -14% -21% -24% -34%
+ -89% -92%
demerphq 8429/s 35% 17% -- -8% -12% -23%
+ -87% -91%
fastolfe 9181/s 47% 27% 9% -- -4% -16%
+ -86% -90%
fast_merl 9563/s 53% 32% 13% 4% -- -12%
+ -85% -90%
demq_scan 10908/s 75% 51% 29% 19% 14% --
+ -83% -88%
fast_c 65634/s 951% 809% 679% 615% 586% 502%
+ -- -28%
scan_c 91428/s 1364% 1166% 985% 896% 856% 738%
+ 39% --
# simple failure case
Rate blakem merlyn demerphq demq_scan fastolfe fast_merl
+ scan_c fast_c
blakem 7759/s -- -39% -51% -60% -63% -63%
+ -94% -94%
merlyn 12666/s 63% -- -20% -35% -39% -40%
+ -89% -91%
demerphq 15783/s 103% 25% -- -19% -24% -26%
+ -87% -89%
demq_scan 19581/s 152% 55% 24% -- -5% -8%
+ -84% -86%
fastolfe 20720/s 167% 64% 31% 6% -- -2%
+ -83% -85%
fast_merl 21209/s 173% 67% 34% 8% 2% --
+ -82% -85%
scan_c 119642/s 1442% 845% 658% 511% 477% 464%
+ -- -14%
fast_c 139171/s 1694% 999% 782% 611% 572% 556%
+ 16% --
Source: http://fastolfe.net/transient/2001/11/02/pm.string.difference.bench | [reply] [d/l] [select] |
|
UPDATED to reflect Fastolfes reply.
Heh. While I was setting up my benchmark there everyone else was as well. I came up with a different set of results. First jungleboy, runrig and yuckfoo failed outright on some of my test cases. Second both merlyn and fast_merl failed because the resulting set of letters are out of order, but when I put a sort clause in as was recommended they passed fine. Yours was the fastest. Only yours, mine and blakem succeded outright.
# Updated: Removed unecessary debug code, minor tidy.
use strict;
use warnings;
use Benchmark 'cmpthese';
#-----------------------------------------------------------
our %subs = (
yuckfoo => sub {
my ( $full, $part ) = @_;
my $left = "";
my ($regx) = join ( '+.*', ( split ( '', $part ) ) );
if ( $full =~ m{$regx} )
{
my (@fulls) = split ( '', $full );
my (@parts) = split ( '', $part );
while ( my $ch = shift (@fulls) )
{
( @parts && $ch eq $parts[0] ) ? shift (@parts) :
( $left .= $ch );
}
} else
{
$left = "__undef__";
}
return $left;
},
jungleboy => sub {
my ( $full, $part ) = @_;
my ($ch);
my (@parts) = split ( '', $part );
my ($regx) = join ( '+.*', @parts );
if ( $full =~ m{$regx} )
{
foreach $ch (@parts)
{
$full =~ s/[$ch]{1}//;
}
} else
{
$full = "__undef__";
}
return $full;
},
demerphq => sub {
my ( $from, $to ) = @_;
my %ltrs;
$ltrs{ substr( $from, $_, 1 ) }++ foreach 0 .. length($from) -
+ 1;
--$ltrs{ substr( $to, $_, 1 ) } < 0
&& return "__undef__" foreach 0 .. length($to) - 1;
return join ( "", sort map { $_ x $ltrs{$_} } keys %ltrs );
},
demq_scan => sub {
my ( $from, $to ) = @_;
my $ret = "";
my ( $f, $t ) = ( 0, 0 );
while (1)
{
my ( $fc, $tc ) = ( substr( $from, $f, 1 ), substr( $to, $
+t, 1 ) );
if ( $fc eq $tc )
{
$t++;
$f++;
if ( substr( $to, $t, 1 ) ne $tc )
{
$f++, $ret .= $fc while substr( $from, $f, 1 ) eq
+$fc;
}
last if $t == length($to);
} elsif ( $fc lt $tc )
{
$ret .= $fc;
$f++;
return "__undef__" if $f >= length $from;
} else
{
return "__undef__";
}
}
return $ret . substr( $from, $f );
},
blakem => sub {
my $string = shift;
my $letters = shift;
$string =~ s/$_// || return "__undef__" for split // => $lette
+rs;
return $string;
},
fastolfe => sub {
my $source = shift;
my $chop = shift;
local ($_);
my %found;
$found{$_}++ while ( $_ = chop($source) ) ne '';
while ( ( $_ = chop($chop) ) ne '' )
{
return "__undef__" if --$found{$_} < 0;
}
my $result = ""; #fixed demerphq
foreach ( sort keys %found )
{
$result .= $_ while $found{$_}--;
}
$result;
},
merlyn => sub {
my ( $full, $part ) = @_;
my %count;
$count{$_}++ for split //, $full;
for ( split //, $part )
{
return "__undef__" if --$count{$_} < 0;
}
return join "", map { $_ x $count{$_} } sort keys %count;
},
fast_merl => sub {
my ( $source, $chop ) = @_;
local ($_);
my %found;
$found{$_}++ while ( $_ = chop($source) ) ne '';
while ( ( $_ = chop($chop) ) ne '' )
{
return "__undef__" if --$found{$_} < 0;
}
return join "", map { $_ x $found{$_} } sort keys %found;
}
# Not sure why it fails..
# runrig => sub {
# my $str = shift;
# my $letters = shift;
# my ( %hash1, %hash2 );
# my @arr1 = split //, $str;
# @hash1{@arr1} = @arr1;
# my @arr2 = split //, $letters;
# my @deleted = grep $_, delete @hash1{@arr2};
# @hash2{@arr2} = @arr2;
# delete @hash2{@deleted};
# return "__undef__" if %hash2;
# join '', keys %hash1;
# },
);
sub test
{
my ($sub) = @_;
foreach my $t (
# tests [0] - [1] = [2]
[qw"ab a b"],
[qw"acciimmnnnnoootu acimnotu cimnnnoo"],
[qw"acciimmnnnnoootu acimnotuu __undef__"],
[qw"ab ab", ""],
[qw"aaaaaaabbbbbbbcccccccddddddddde e aaaaaaabbbbbbbcccccccddd
+dddddd"],
[qw"aaaaaaabbbbbbbcccccccddddddddde aaaaaaabbbbbbbcccccccddddd
+dddd e"],
[qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxyz aaaaa
+aabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxy z"],
[qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxyz aaaaa
+bbbbbbcccccddddddddghijklmnopqrstuvwxy aabccdefz"],
[qw"abbcccdde bccd abcde"],
[qw"abbcccdde bccdf __undef__"],
[qw"bbcccdde abccdf __undef__"],
#uncomment me to kill the regex versions
#[qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxz aaaaa
+bbbbbbcccccddddddddghijklmnopqrstuvwxy __undef__"],
)
{
#warn "$sub @$t\n";
my $r = $subs{$sub}->( $t->[0], $t->[1] );
$r = ( !defined($r) ? "undef" : $r );
#warn "'$r'\n";
die "in $sub expected $t->[2] got $r from $t->[0] - $t->[1]"
if $t->[2] ne $r;
}
}
my %tests = map { $_ => "test($_)" } keys %subs;
cmpthese( -10, \%tests );
__END__
Benchmark: running blakem, demerphq, demerphq_scan, fast_merl, fastolfe, jungleboy, merlyn, yuckfoo,
each for at least 10 CPU seconds...
blakem: 11 wallclock secs (10.52 usr + 0.00 sys = 10.52 CPU) @ 497.48/s (n=5231)
demerphq: 11 wallclock secs (10.72 usr + 0.00 sys = 10.72 CPU) @ 755.88/s (n=8100)
demq_scan: 11 wallclock secs (10.57 usr + 0.00 sys = 10.57 CPU) @ 731.63/s (n=7737)
fast_merl: 10 wallclock secs (10.61 usr + 0.00 sys = 10.61 CPU) @ 768.34/s (n=8149)
fastolfe: 10 wallclock secs (10.55 usr + 0.00 sys = 10.55 CPU) @ 779.23/s (n=8217)
jungleboy: 10 wallclock secs (10.18 usr + 0.00 sys = 10.18 CPU) @ 20.03/s (n=204)
merlyn: 10 wallclock secs (10.56 usr + 0.00 sys = 10.56 CPU) @ 595.40/s (n=6285)
yuckfoo: 10 wallclock secs (10.50 usr + 0.00 sys = 10.50 CPU) @ 20.10/s (n=211)
Rate jungleboy yuckfoo blakem merlyn demerphq_scan demerphq fast_merl fastolfe
jungleboy 20.0/s -- -0% -96% -97% -97% -97% -97% -97%
yuckfoo 20.1/s 0% -- -96% -97% -97% -97% -97% -97%
blakem 497/s 2383% 2374% -- -16% -32% -34% -35% -36%
merlyn 595/s 2872% 2861% 20% -- -19% -21% -23% -24%
demq_scan 732/s 3552% 3539% 47% 23% -- -3% -5% -6%
demerphq 756/s 3673% 3660% 52% 27% 3% -- -2% -3%
fast_merl 768/s 3736% 3722% 54% 29% 5% 2% -- -1%
fastolfe 779/s 3790% 3776% 57% 31% 7% 3% 1% --
Yves / DeMerphq
--
Have you registered your Name Space? | [reply] [d/l] |
|
| [reply] |
|
Thanks for the chop modification. Nice touch. It seems to be at least 10-20% faster than split,
I'll take it! I also appreciate the extensive bench compliation. Very interesting, and very informative.
| [reply] |
Re: Difference Of Two Strings
by runrig (Abbot) on Nov 03, 2001 at 04:51 UTC
|
my $str="abcd";
my $letters = "ab";
print not_in($str, $letters),"\n";
sub not_in {
my $str = shift;
my $letters = shift;
my (%hash1, %hash2);
my @arr1 = split //, $str;
@hash1{@arr1} = @arr1;
my @arr2 = split //, $letters;
my @deleted = grep $_, delete @hash1{@arr2};
@hash2{@arr2} = undef;
delete @hash2{@deleted};
return if %hash2;
join '', keys %hash1;
}
Updated to more accurately reflect the requirements :-)
Actually, this one doesn't remove characters one for one
as it seems you want to. If a character appears in the second arg, it removes all of that character in the first
arg. That's not what you wanted is it?? | [reply] [d/l] |
|
| [reply] |
Re: Difference Of Two Strings
by JungleBoy (Scribe) on Nov 03, 2001 at 05:06 UTC
|
I wanted to remove the split() entirely, but my knowledge of Perl isn't strong enought yet to come up with that solution. Here's a slight improvement that I came up with.
#!/usr/bin/perl
use strict;
my ($full, $part) = @ARGV;
my ($left) = leftover($full, $part);
if (defined($left)) { print "$full - $part = $left\n"; }
else { print "$full - $part = undef\n"; }
#-----------------------------------------------------------
sub leftover {
my ($full, $part) = @_;
my ($ch);
my (@parts) = split('', $part);
my ($regx) = join('+.*', @parts);
if ($full =~ m{$regx}) {
foreach $ch (@parts)
{
$full =~ s/[$ch]{1}//;
}
}
else { $full = undef; };
return $full;
}
Here's pseudo code for what my original idea was (maybe someone else can make it work):
sub leftover{
my ($full, $part) = @_;
while ($full && $part)
{
#RexExp to substitute nothing for the first match
#then remove the matched character from $part
}
}
| [reply] [d/l] [select] |
|
| [reply] |
Re: Difference Of Two Strings (in C)
by Fastolfe (Vicar) on Nov 04, 2001 at 00:38 UTC
|
And lastly, since it's speed we're after, a C implementation can't hurt. This benchmarks orders of magnitude faster than anything seen yet. (Update: fixed memory leak)
use Inline C => <<'__EOC__';
SV *fast_c (char *original, char *chopped) {
int counts[256] = {0}; /* each potential character */
int ptr = 0;
int buffer_size = 0;
char *ret = NULL;
int ret_ptr = 0;
int error = 0;
SV *retsv = &PL_sv_undef;
while (original[ptr] != '\0') {
counts[original[ptr++]]++;
buffer_size++;
}
ptr = 0;
while (!error && chopped[ptr] != '\0') {
counts[chopped[ptr]]--;
buffer_size--;
if (counts[chopped[ptr++]] < 0) {
error++;
}
}
if (!error) {
ret = malloc(buffer_size + 1);
for (ptr = 0; ptr <= 255; ptr++) {
while (counts[ptr]-- > 0) {
ret[ret_ptr++] = ptr;
}
}
ret[ret_ptr] = '\0';
retsv = newSVpvn(ret, strlen(ret));
free(ret);
}
return(retsv);
}
__EOC__
And then here's a C implementation of demerphq's "scanning" method, which doesn't rely on counting up letters. This one's even faster:
use Inline C => <<'__EOC__';
SV *scan_c (char *from, char *to) {
int f = 0;
int t = 0;
int from_len = strlen(from);
int to_len = strlen(to);
int ret_ptr = 0;
unsigned char fc, tc;
int error = 0;
SV *retsv;
char *ret;
if (!from_len || !to_len) return(&PL_sv_undef);
ret = malloc(from_len > to_len ? from_len+1 : to_len+1);
while(!error) {
fc = from[f];
tc = to[t];
if (fc == tc) {
f++; t++;
if (to[t] && (to[t] != tc)) {
while (from[f] == fc) {
f++;
ret[ret_ptr++] = fc;
}
}
if (t == to_len) error = 1;
} else if (!fc || (fc < tc)) {
ret[ret_ptr++] = fc;
f++;
if (f >= from_len) error = 2;
} else {
error = 2;
}
}
if (error < 2) {
while(f <= from_len) {
ret[ret_ptr++] = from[f++];
}
retsv = newSVpvn(ret, strlen(ret));
} else {
retsv = &PL_sv_undef;
}
free(ret);
return retsv;
}
__EOC__
| [reply] [d/l] [select] |
|
Thats really cool Fastolfe its neat to see my code converted to C. Im really going to have to bone up my skills in that area.
Thanks alot, it looks like you just provided me an excuse to learn something new...
Yves / DeMerphq
--
Have you registered your Name Space?
| [reply] |
|
| [reply] |
Re: Difference Of Two Strings
by Dr. Mu (Hermit) on Nov 05, 2001 at 09:29 UTC
|
And now for something a little different...
This method certainly won't win any speed prizes. And I'm not even sure why I did it -- except just to see it work. It relies on the fact that any composite number can be factored but one way into its constituent primes. We begin by assigning each letter of the alphabet a unique prime number. Each unordered string of letters -- a set, really -- can be represented as the product of the primes those letters represent. The difference between two strings, then, will be the quotient of the first product divided by the second, factored into its constituent primes and converted back to letters. This is assuming that the numbers evenly divide, i.e. that the remainder of the division is zero. If not, the string difference is undefined, because it means the second string has letters in it not contained in the first.
The following program illustrates the technique:
use Math::BigInt;
@Primes = (2,3,5,7,11,13,17,19,23,29,31,37,41,
43,47,53,59,61,67,71,73,79,83,89,97,101);
@AsciiPrimes = ((0) x 65, @Primes, (0) x 6, @Primes, (0) x 133);
print StringDif('abcdgoldfish', 'flash'),"\n";
print StringDif('lmnogoldfish', 'dish'),"\n";
print StringDif('osar', 'oar'),"\n";
sub StringDif {
my ($StrA, $StrB) = @_;
my $A = Math::BigInt->new(1);
my $B = Math::BigInt->new(1);
my $Quot, $Rem, $NewQuot, $ReturnStr;
foreach (unpack 'C*', $StrA) {$A *= $AsciiPrimes[$_]}
foreach (unpack 'C*', $StrB) {$B *= $AsciiPrimes[$_]}
if ($A eq '+0' or $B eq '+0') {
warn "Non-letter in string"; return undef
}
($Quot, $Rem) = Math::BigInt->new($A)->bdiv($B);
return undef unless $Rem eq '+0';
$ReturnStr = '';
foreach('a'..'z') {
do {
($NewQuot, $Rem) = Math::BigInt->new($Quot)
->bdiv($AsciiPrimes[ord $_]);
if ($Rem eq '+0') {
$ReturnStr .= $_;
$Quot = $NewQuot
}
} while $Rem eq '+0'
}
return $ReturnStr
}
Note that, due to the use of Math::BigInt, strings can be as long as you like. Also, even though the input strings don't have to be sorted, the output string will be. | [reply] [d/l] |
|
|