I've been working on a solution to the evil problem of $& making all your regexes do more work. So I spent the last two days coming up with the following pragma (I wrote a pragma!) that offers control of this dastardly variable. Hopefully, it will make it into Perl 5.8. You can't just download this module, though -- I had to alter a couple files in the source to jive with what this pragma does.
re::ampersand - Perl pragma to alter $& support in regular expressions
"Perl" =~ /../ and print "<$&>"; # <Pe>
"Perl" =~ /er/ and print "<$&>"; # <er>
{
# disable $& support
no re::ampersand;
"Perl" =~ /../ and print "<$&>"; # <>
"Perl" =~ /er/ and print "<$&>"; # <>
}
{
# disable $& support for simple regexes
no re::ampersand 'simple';
"Perl" =~ /../ and print "<$&>"; # <Pe>
"Perl" =~ /er/ and print "<$&>"; # <>
}
{
# disable $& support for complex regexes
no re::ampersand 'complex';
"Perl" =~ /../ and print "<$&>"; # <>
"Perl" =~ /er/ and print "<$&>"; # <er>
}
When Perl sees you using $`, $&, or $', it has to prepare these
variable after every successful pattern match. This can slow a program down
because these variables are "prepared" by copying the string you matched
against to an internal location. This copying is also how $DIGIT
variables are made accessible, but that only occurs on a per-regex basis:
if a regex has capturing parentheses, the string will be copied, otherwise
it will not be.
Some regexes are simple enough to be matched via the Boyer-Moore substring
matching algorithm. This is a fast approach at finding a substring in a
string. Regexes that only rely on constant text and anchors can be matched
via the Boyer-Moore algorithm. (These regexes cannot have capturing
parentheses.) Because of this, they don't get solved through the standard
regex engine, and end up not preparing $& and its friends -- there is
no copying of the string that was matched.
However, if Perl has seen you using $&, it decides that the simple regex
has to go through the engine so it can prepare $&. This means that there
is a two-fold slow-down: first, the simple regex has to go through both the
Boyer-Moore algorithm and the rest of the regex engine, and second, it has to
copy the string that was being matched against.
The re::ampersand pragma allows you to ignore the fact that $& (or its
friends) has been used in your program. This produces a speed-up in portions
of your code that do not need support for $&. This pragma is
lexically scoped, which means it works in the block you call it in.
This module does not turn off capturing support -- if a regex has capturing
parentheses in it, you will inadvertently get support for $&, because it is
based on the copied string that $1, $2, ... are based on.
Your program will run the same way it did before if you do not use this
pragma. Default behavior has not been changed.
You can turn off support for $& and friends with no re::ampersand, which
turns off support for all regexes (unless they have capturing parentheses).
If you only want to turn off support for simple regexes, send it the argument
'simple'. If you only want to turn off support for complex regexes, send it
the argument 'complex'.
Turn on support for $& with use re::ampersand which turns on support for
all regexes. To only supply support to simple regexes, send it the argument
'simple'. To only supply support to complex regexes, send it the argument
'complex'. Again, any regex with capturing parentheses will always have
support for $& because of the mechanism that provides $DIGIT variables.
#!/usr/bin/perl -w
no re::ampersand;
# simple regex is not weighed down by $&
"Perl" =~ /..$/ and print "<$&>\n"; # <>
{
use re::ampersand;
"Perl" =~ /^../ and print "<$&>\n"; # <Pe>
}
"Perl" =~ /..(?=.$)/ and print "<$&>\n"; # <>
#!/usr/bin/perl -w
# regexes set $&
"Perl" =~ /(?<=.)./ and print "<$&>\n"; # <e>
{
no re::ampersand;
# matching on a string you'd rather not have copied!
$huge_string =~ /a+bc+/ and print "<$&>\n"; # <>
}
# regexes set $&
"Perl" =~ /.(?!..)/ and print "<$&>\n"; # <r>
Jeff japhy Pinyan, japhy@pobox.com.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Finally, a $& compromise!
by dws (Chancellor) on Nov 28, 2001 at 07:06 UTC
|
Again, any regex with capturing parentheses will always have support for $& because of the mechanism that provides $DIGIT variables.
Does this mean that a regexp that captures $1 implies $& which implies the performance hit for maintaining $`, $&, and $' ? I thought that you only took on the performance hit if you explicitly used $`, $&, or $'.
| [reply] |
|
#!/usr/bin/perl
"simple" =~ /im/ and eval q{ print "<$`><$&><$'>\n" };
###
#!/usr/bin/perl
"complex" =~ /.p/ and eval q{ print "<$`><$&><$'>\n" };
#<co><mp><lex>
###
#!/usr/bin/perl
"capture" =~ /(.t)./ and eval q{ print "<$`><$&><$'>:<$1>\n" };
#<ca><ptu><re>:<pt>
Does that make sense? In order to have $1, you have to have the string that is also used for $&. From perlre:
WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression (?: ... ) instead.) But if you never use $&, $` or $', then patterns without capturing parentheses will not be penalized. So avoid $&, $', and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the
other two.
Some of that will be rewritten with the advent of this pragma, though. It's nice to "rewrite the books".
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] [d/l] [select] |
Re: Finally, a $& compromise!
by TheDamian (Vicar) on Nov 28, 2001 at 11:41 UTC
|
Excellent! Well done.
I don't suppose you could pragmatize $` and $' too, whilst you were at it?
;-)
| [reply] |
|
| [reply] |
|
The big problem is PL_sawampersand is globally scoped
Yes, I've been quiety following the thread in P5P.
Yet another example of why global variables almost always turn out to be a Very Bad Idea, no matter how clever the programmer or how convenient they may seem at the time.
Remember kids: Global variables - just say no!
| [reply] |
(tye)Re: Finally, a $& compromise!
by tye (Sage) on Nov 28, 2001 at 20:18 UTC
|
Neat idea. I don't think it is very practical, tho. The way it is currently designed we'll need to update tons of modules to add "no re::ampersand;" to each file and only then will you be able to use $& in your scripts without the regex in modules being slowed down.
Similarly, if I want to write a module that use $&, I can't do it in a way that protects the scripts that use my module from the performance penalty.
For this to be practical, you need to be able to isolate the penalty of $& to a lexical block. So that I could say:
{
use re::ampersand;
$x =~ /.{10}/;
$y= $&;
}
and the presense of use re::ampersand would make the penalty of "saw ampersand" go away outside of that block.
I agree that the lack of re::ampersand should keep the old behavior of global slow-down. But you need to come up with a better way to accomplish this for this neat idea to be of real practical value.
I guess you could add a new variable, say "expectampersand" and have regex be "slow" in code compiled where "expectampersand" was true or when "sawampersand" was true at run time. And then you would set "sawampersand" whenever you saw one of the three "problem" variables at compile time during a phase when "expectampersand" was false.
So "sawampersand" would mean that you saw one of the variables outside of a block that did use re::ampersand and a new lexically-scoped compiler hint, "expectampersand", would be added. Noncapturing regexes would be slower within lexical scopes that did use re::ampersand and noncapturing regexes everywhere would be slower if you ever used one of the "problem" variables outside of such a lexical scope.
On top of this, you could add your current feature of no re::ampersand so that you could also have some code that does lots of non-capturing regex work that you don't want to be penalized in the face of someone somewhere using a "problem" variables w/o scoping the impact. But I think the other feature is much more important.
Make sense? Is that along the lines of what was already suggested by Hugo and Jarkko?
It is kind of like "fence in" vs. "fence out" states in regard to cattle... (:
-
tye
(but my friends call me "Tye") | [reply] [d/l] [select] |
|
You brought up the same point as what was suggested by Hugo and Jarkko, and I'm pleased to present a modified version of the pragma. It comes with one caveat: please do not do no re::ampersand unless you're aware of the consequences. If you do that, then any $& found in the no re::ampersand block will set the PL_sawampersand flag for your program. This doesn't sound like a good idea to me, so I advise against using it. I'm still trying to find some way to make PL_sawampersand work like a stack somehow... so that the no re::ampersand doesn't leak like that.
Check out New $& Approach, thanks to Hugo, Jarkko, Tye, and others.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
| [reply] |
Re: Finally, a $& compromise!
by perrin (Chancellor) on Nov 28, 2001 at 12:04 UTC
|
Finally, a $& compromise!
What the #$%! Who told you it was okay to use that kind of $*@% language here? Do you kiss your camel with that mouth? | [reply] |
Re (tilly) 1: Finally, a $& compromise!
by tilly (Archbishop) on Nov 28, 2001 at 20:18 UTC
|
Nice, but after some playing around I can guarantee that you have solved the least of my problems.
What I really want is conditional support for $` and $'. That is much worse.
To give a simple example, try the following 3 programs:
#!/usr/bin/perl
# This demonstrates matching through a string
use strict;
use Time::HiRes qw(gettimeofday);
my $start = gettimeofday();
my $str = "_" x $ARGV[0];
1 while $str =~ /./g;
my $elapsed = gettimeofday() - $start;
print "$ARGV[0] characters took $elapsed seconds\n";
#!/usr/bin/perl
# This demonstrates matching through a string, capturing
use strict;
use Time::HiRes qw(gettimeofday);
my $start = gettimeofday();
my $str = "_" x $ARGV[0];
1 while $str =~ /(.)/g;
my $elapsed = gettimeofday() - $start;
print "$ARGV[0] characters took $elapsed seconds\n";
#!/usr/bin/perl
# This demonstrates matching through a string, capturing $`
use strict;
use Time::HiRes qw(gettimeofday);
# Mess life up here
if ("gotcha" =~ /o/) {
my $fooey = $`;
}
my $start = gettimeofday();
my $str = "_" x $ARGV[0];
1 while $str =~ /(.)/g;
my $elapsed = gettimeofday() - $start;
print "$ARGV[0] characters took $elapsed seconds\n";
If you try it you will find that the first two versions run linearly, with only a modest speed difference for the capturing. But the third is a quadratic speed drop. Should you ever, as I do, use REs as a way of tokenizing, this means that the use of a single $` or $', anywhere, turns linear algorithms quadratic. By contrast turning $& on and off is not going to change your program's scalability.
For that reason if I write something for other people's use, I would really like the option of turning off $` and $' lexically. Because I want access to $1 without being hammered with $` and $'. | [reply] [d/l] |
|
| [reply] |
|
As specifically pointed out in the section Capturing still works, the use of $1 and friends turns support for them back on, despite the pragma.
I did read what japhy wrote before I posted. While it is nice to turn off the production of capturing strings, the main time I do enough matching that I really care about capturing or not is tokenizing, when I am almost definitely going to be using capturing matches. So this module helps with the problem, but not in the only case where I care.
| [reply] |
|
|
|
Re: Finally, a $& compromise!
by BrentDax (Hermit) on Nov 28, 2001 at 06:59 UTC
|
| [reply] |
|
|