. This is a discussion on why regexes might not be so hot in a particular arena, and why my module helps out.
I'd like it to be known that, on a string with many X's in it, doing
$str =~ s/X[^X]*$/X/;
is not very fast. It tries to match at each X it finds. You'd probably be better off doing
$str =~ s/(.*X).*/$1/;
Although in that case, it'd be sweet to use
Regexp::Keep and say:
$str =~ s/.*X\K.*//;
or do it the two-regex way (since \K isn't core Perl, it's not as fast):
$str =~ /.*X/g and $str =~ s/\G.*//;
Or you could reverse the string:
($str = reverse $str) =~ s/^[^X]+//;
$str = reverse $str;
Of course, for this task, a regex is probably the wrong tool. You can use string functions:
substr($str, rindex($str, "X")+1) = "";
Here's a benchmark of these methods (leaving out \K, because it's not worth showing):
use Benchmark 'cmpthese';
my $str = "alphabet X alphabet" x 100 . "junk at the end" x 10;
cmpthese(-5, {
# is this the last X?
last => sub { my $x = $str; $x =~ s/X[^X]*$/X/ },
# capture up to the last X and replace it with itself
capt_repl => sub { my $x = $str; $x =~ s/(.*X).*/$1/ },
# match up to the last X, then remove everything after it
rx_rx => sub { my $x = $str; $x =~ /.*X/g and $x =~ s/\G.*// },
# reverse, remove up to first X, reverse
sexeger => sub {
my $x = $str; ($x = reverse $x) =~ s/^[^X]+//; $x = reverse $x;
},
# find the last X, remove everything after it
substr => sub { my $x = $str; substr($x, rindex($x, "X")+1) = "" },
});
__END__
Rate last sexeger capt_repl rx_rx substr
last 6398/s -- -84% -84% -90% -97%
sexeger 39625/s 519% -- -4% -37% -83%
capt_repl 41274/s 545% 4% -- -35% -82%
rx_rx 63380/s 891% 60% 54% -- -73%
substr 230796/s 3507% 482% 459% 264% --
Notice how much better-suited
substr() is for this task.
You encounter a problem in speed, though, when X is not just a character, but a character class. First of all, our substr() approach fails immediately, because it uses index(), which looks for a substring, not one of a set of characters. Let's do the same benchmark, but change X to the character class of A-Z.
cmpthese(-5, {
last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ },
capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ },
rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// },
sexeger => sub {
my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x;
},
});
__END__
Rate last capt_repl rx_rx sexeger
last 5591/s -- -82% -86% -86%
capt_repl 30599/s 447% -- -21% -24%
rx_rx 38948/s 597% 27% -- -3%
sexeger 40049/s 616% 31% 3% --
This slow-down is caused by the character class. Because we're not looking for a SINGLE character, we can't jump backwards (the regex engine knows how to handle
/.*A/ quickly -- it can "jump" backward to an "A", instead of examining each character -- but it can't handle
/.*[AB]/ as quickly).
So what can we do? We can use String::Index, which gives us functions that act like C's strpbrk(), but can do even more. For those of you not familiar with strpbrk() (whose name I can't decipher), it takes a string to look at and a string of characters to find in that source string.
/* C code */
printf("%s", strpbrk("breakfast", "aeiou"));
/* prints: eakfast */
In Perl, it'd be like doing:
if ("breakfast" =~ /[aeiou]/) {
$x = substr("breakfast", $-[0]);
}
That is, it returns the earliest location in the source string of one of the characters in the second string. It's uncool that there's no standard C function that does this from the back of the string, or for all characters except those given...
That's what the String::Index module was written to do! Let's apply it to this problem and run another benchmark:
# crindex is char-class-based rindex()
use String::Index 'crindex';
cmpthese(-5, {
last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ },
capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ },
rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// },
sexeger => sub {
my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x;
},
crindex => sub {
my $x = $str;
substr($x, crindex($x, "ABCDEFGHIJKLMNOPQRSTUVWXYZ")+1) = "";
},
});
__END__
Rate last capt_repl rx_rx sexeger crindex
last 5581/s -- -82% -85% -86% -90%
capt_repl 30606/s 448% -- -20% -23% -44%
rx_rx 38194/s 584% 25% -- -4% -30%
sexeger 39594/s 609% 29% 4% -- -28%
crindex 54829/s 882% 79% 44% 38% --
We are restored. It's not as fast as the original
substr() approach because it has to do more work, but it's faster than any other solution.