Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Finally, a $& compromise!

by japhy (Canon)
on Nov 28, 2001 at 06:35 UTC ( [id://127988]=perlmeditation: print w/replies, xml ) Need Help??

I've been working on a solution to the evil problem of $& making all your regexes do more work. So I spent the last two days coming up with the following pragma (I wrote a pragma!) that offers control of this dastardly variable. Hopefully, it will make it into Perl 5.8. You can't just download this module, though -- I had to alter a couple files in the source to jive with what this pragma does.

NAME

re::ampersand - Perl pragma to alter $& support in regular expressions


SYNOPSIS

    "Perl" =~ /../ and print "<$&>";  # <Pe>
    "Perl" =~ /er/ and print "<$&>";  # <er>
    {
      # disable $& support
      no re::ampersand;
      "Perl" =~ /../ and print "<$&>";  # <>
      "Perl" =~ /er/ and print "<$&>";  # <>
    }
    {
      # disable $& support for simple regexes
      no re::ampersand 'simple';
      "Perl" =~ /../ and print "<$&>";  # <Pe>
      "Perl" =~ /er/ and print "<$&>";  # <>
    }
    {
      # disable $& support for complex regexes
      no re::ampersand 'complex';
      "Perl" =~ /../ and print "<$&>";  # <>
      "Perl" =~ /er/ and print "<$&>";  # <er>
    }


DESCRIPTION

When Perl sees you using $`, $&, or $', it has to prepare these variable after every successful pattern match. This can slow a program down because these variables are "prepared" by copying the string you matched against to an internal location. This copying is also how $DIGIT variables are made accessible, but that only occurs on a per-regex basis: if a regex has capturing parentheses, the string will be copied, otherwise it will not be.

Simple vs. Complex

Some regexes are simple enough to be matched via the Boyer-Moore substring matching algorithm. This is a fast approach at finding a substring in a string. Regexes that only rely on constant text and anchors can be matched via the Boyer-Moore algorithm. (These regexes cannot have capturing parentheses.) Because of this, they don't get solved through the standard regex engine, and end up not preparing $& and its friends -- there is no copying of the string that was matched.

However, if Perl has seen you using $&, it decides that the simple regex has to go through the engine so it can prepare $&. This means that there is a two-fold slow-down: first, the simple regex has to go through both the Boyer-Moore algorithm and the rest of the regex engine, and second, it has to copy the string that was being matched against.

Ignoring $&

The re::ampersand pragma allows you to ignore the fact that $& (or its friends) has been used in your program. This produces a speed-up in portions of your code that do not need support for $&. This pragma is lexically scoped, which means it works in the block you call it in.

Capturing still works

This module does not turn off capturing support -- if a regex has capturing parentheses in it, you will inadvertently get support for $&, because it is based on the copied string that $1, $2, ... are based on.


USAGE

Not using this pragma

Your program will run the same way it did before if you do not use this pragma. Default behavior has not been changed.

Turning off $& support

You can turn off support for $& and friends with no re::ampersand, which turns off support for all regexes (unless they have capturing parentheses). If you only want to turn off support for simple regexes, send it the argument 'simple'. If you only want to turn off support for complex regexes, send it the argument 'complex'.

Turning on $& support

Turn on support for $& with use re::ampersand which turns on support for all regexes. To only supply support to simple regexes, send it the argument 'simple'. To only supply support to complex regexes, send it the argument 'complex'. Again, any regex with capturing parentheses will always have support for $& because of the mechanism that provides $DIGIT variables.


EXAMPLES

Support for $& in a block of a program

  #!/usr/bin/perl -w
  no re::ampersand;
  # simple regex is not weighed down by $&
  "Perl" =~ /..$/ and print "<$&>\n";  # <>
  {
    use re::ampersand;
    "Perl" =~ /^../ and print "<$&>\n";  # <Pe>
  }
  "Perl" =~ /..(?=.$)/ and print "<$&>\n";  # <>

Turning off support for $& in a block

  #!/usr/bin/perl -w
  # regexes set $&
  "Perl" =~ /(?<=.)./ and print "<$&>\n";  # <e>
  {
    no re::ampersand;
    # matching on a string you'd rather not have copied!
    $huge_string =~ /a+bc+/ and print "<$&>\n";  # <>
  }
  # regexes set $&
  "Perl" =~ /.(?!..)/ and print "<$&>\n";  # <r>


AUTHOR

Jeff japhy Pinyan, japhy@pobox.com.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Finally, a $& compromise!
by dws (Chancellor) on Nov 28, 2001 at 07:06 UTC
    Again, any regex with capturing parentheses will always have support for $& because of the mechanism that provides $DIGIT variables.

    Does this mean that a regexp that captures $1 implies $& which implies the performance hit for maintaining $`, $&, and $' ? I thought that you only took on the performance hit if you explicitly used $`, $&, or $'.

      Does this mean that a regexp that captures $1 implies $& which implies the performance hit for maintaining $`, $&, and $' ?
      Only for maintaining them for that regex. The way that $DIGIT variables are supported is thus:
      1. The string being matched against is copied (via savepvn()) to rx->subbeg.
      2. The offsets of the $DIGIT vars are stored in the two arrays rx->startp and rx->endp.
      3. When you access $2, Perl does magic:
        1. It takes the beginning and ending offsets, rx->startp[2] and rx->endp[2], and takes a substring of rx->subbeg.
        2. It savepvn()s (copies) that substring to a scalar and returns it.
      However, this only happens in a regex that has capturing parentheses! If you have a regex that does NOT have capturing parentheses, it does not need to copy the string.

      The $DIGIT vars are like tiny instances of $& that only appear when you need them. $& appears all the time if you use it once. Here's an example that shows that a regex that uses capturing parentheses gives you the ability to use $& and the like. These are three separate programs. I'm using eval '' so that $& isn't seen at the time the regexes are executed.

      #!/usr/bin/perl "simple" =~ /im/ and eval q{ print "<$`><$&><$'>\n" }; ### #!/usr/bin/perl "complex" =~ /.p/ and eval q{ print "<$`><$&><$'>\n" }; #<co><mp><lex> ### #!/usr/bin/perl "capture" =~ /(.t)./ and eval q{ print "<$`><$&><$'>:<$1>\n" }; #<ca><ptu><re>:<pt>
      Does that make sense? In order to have $1, you have to have the string that is also used for $&. From perlre:
      WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression (?: ... ) instead.) But if you never use $&, $` or $', then patterns without capturing parentheses will not be penalized. So avoid $&, $', and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two.
      Some of that will be rewritten with the advent of this pragma, though. It's nice to "rewrite the books".

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Finally, a $& compromise!
by TheDamian (Vicar) on Nov 28, 2001 at 11:41 UTC
    Excellent! Well done.

    I don't suppose you could pragmatize $` and $' too, whilst you were at it?

    ;-)

      They're part of the same deal. And I'm currently fixing the patch. Hugo suggests things be done a bit differently, and Jarkko agrees. (The big problem is PL_sawampersand is globally scoped...)

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        The big problem is PL_sawampersand is globally scoped

        Yes, I've been quiety following the thread in P5P. Yet another example of why global variables almost always turn out to be a Very Bad Idea, no matter how clever the programmer or how convenient they may seem at the time.

        Remember kids: Global variables - just say no!

(tye)Re: Finally, a $& compromise!
by tye (Sage) on Nov 28, 2001 at 20:18 UTC

    Neat idea. I don't think it is very practical, tho. The way it is currently designed we'll need to update tons of modules to add "no re::ampersand;" to each file and only then will you be able to use $& in your scripts without the regex in modules being slowed down.

    Similarly, if I want to write a module that use $&, I can't do it in a way that protects the scripts that use my module from the performance penalty.

    For this to be practical, you need to be able to isolate the penalty of $& to a lexical block. So that I could say:

    { use re::ampersand; $x =~ /.{10}/; $y= $&; }
    and the presense of use re::ampersand would make the penalty of "saw ampersand" go away outside of that block.

    I agree that the lack of re::ampersand should keep the old behavior of global slow-down. But you need to come up with a better way to accomplish this for this neat idea to be of real practical value.

    I guess you could add a new variable, say "expectampersand" and have regex be "slow" in code compiled where "expectampersand" was true or when "sawampersand" was true at run time. And then you would set "sawampersand" whenever you saw one of the three "problem" variables at compile time during a phase when "expectampersand" was false.

    So "sawampersand" would mean that you saw one of the variables outside of a block that did use re::ampersand and a new lexically-scoped compiler hint, "expectampersand", would be added. Noncapturing regexes would be slower within lexical scopes that did use re::ampersand and noncapturing regexes everywhere would be slower if you ever used one of the "problem" variables outside of such a lexical scope.

    On top of this, you could add your current feature of no re::ampersand so that you could also have some code that does lots of non-capturing regex work that you don't want to be penalized in the face of someone somewhere using a "problem" variables w/o scoping the impact. But I think the other feature is much more important.

    Make sense? Is that along the lines of what was already suggested by Hugo and Jarkko?

    It is kind of like "fence in" vs. "fence out" states in regard to cattle... (:

            - tye (but my friends call me "Tye")
      You brought up the same point as what was suggested by Hugo and Jarkko, and I'm pleased to present a modified version of the pragma. It comes with one caveat: please do not do no re::ampersand unless you're aware of the consequences. If you do that, then any $& found in the no re::ampersand block will set the PL_sawampersand flag for your program. This doesn't sound like a good idea to me, so I advise against using it. I'm still trying to find some way to make PL_sawampersand work like a stack somehow... so that the no re::ampersand doesn't leak like that.

      Check out New $& Approach, thanks to Hugo, Jarkko, Tye, and others.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Finally, a $& compromise!
by perrin (Chancellor) on Nov 28, 2001 at 12:04 UTC
    Finally, a $& compromise!

    What the #$%! Who told you it was okay to use that kind of $*@% language here? Do you kiss your camel with that mouth?

Re (tilly) 1: Finally, a $& compromise!
by tilly (Archbishop) on Nov 28, 2001 at 20:18 UTC
    Nice, but after some playing around I can guarantee that you have solved the least of my problems.

    What I really want is conditional support for $` and $'. That is much worse.

    To give a simple example, try the following 3 programs:

    #!/usr/bin/perl # This demonstrates matching through a string use strict; use Time::HiRes qw(gettimeofday); my $start = gettimeofday(); my $str = "_" x $ARGV[0]; 1 while $str =~ /./g; my $elapsed = gettimeofday() - $start; print "$ARGV[0] characters took $elapsed seconds\n"; #!/usr/bin/perl # This demonstrates matching through a string, capturing use strict; use Time::HiRes qw(gettimeofday); my $start = gettimeofday(); my $str = "_" x $ARGV[0]; 1 while $str =~ /(.)/g; my $elapsed = gettimeofday() - $start; print "$ARGV[0] characters took $elapsed seconds\n"; #!/usr/bin/perl # This demonstrates matching through a string, capturing $` use strict; use Time::HiRes qw(gettimeofday); # Mess life up here if ("gotcha" =~ /o/) { my $fooey = $`; } my $start = gettimeofday(); my $str = "_" x $ARGV[0]; 1 while $str =~ /(.)/g; my $elapsed = gettimeofday() - $start; print "$ARGV[0] characters took $elapsed seconds\n";
    If you try it you will find that the first two versions run linearly, with only a modest speed difference for the capturing. But the third is a quadratic speed drop. Should you ever, as I do, use REs as a way of tokenizing, this means that the use of a single $` or $', anywhere, turns linear algorithms quadratic. By contrast turning $& on and off is not going to change your program's scalability.

    For that reason if I write something for other people's use, I would really like the option of turning off $` and $' lexically. Because I want access to $1 without being hammered with $` and $'.

      As already mentioned, $` and $' are controlled by the same flag and so the pragma already affects them as well.

              - tye (but my friends call me "Tye")
        As specifically pointed out in the section Capturing still works, the use of $1 and friends turns support for them back on, despite the pragma.

        I did read what japhy wrote before I posted. While it is nice to turn off the production of capturing strings, the main time I do enough matching that I really care about capturing or not is tokenizing, when I am almost definitely going to be using capturing matches. So this module helps with the problem, but not in the only case where I care.

Re: Finally, a $& compromise!
by BrentDax (Hermit) on Nov 28, 2001 at 06:59 UTC
    *applauds*

    =cut
    --Brent Dax
    There is no sig.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://127988]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2024-04-18 16:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found