in reply to fast greedy regex

You have fixed width fields in the date & time, so unpack is worth a look:

my $string = '2004-05-13 14:02:00 blah blah'; my ($year, $month, $day, $hour, $min, $sec, $rest) = unpack 'a4 x a2 x a2 x a2 x a2 x a2 x a*', $string;

After Compline,
Zaxo

Replies are listed 'Best First'.
Re^2: fast greedy regex
by BrowserUk (Patriarch) on Jun 07, 2004 at 21:36 UTC

    I had the same thought, but unless I am doing something dumb (quite likely:), then strangely it seems that unpack is slower than even the explicit regex?

    #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our @data = map{ join' ', '2004-05-13', '14:02:00', ('blah') x (1+rand( 9 )) } 1 .. 1000; cmpthese( -1, { greedy => q[ my( $date, $time, $text ); m[(^\S*)\s(\S*)\s(.*$)] and ( $date, $time, $text ) = ( $1, $2, $3 ) # and print "greedy: $date|$time|$text" for @data; ], explicit => q[ my( $date, $time, $text ); m[(^\d{4}\-\d{2}\-\d{2})\s(\d{2}:\d{2}:\d{2})\s(.*$)] and ( $date, $time, $text ) = ( $1, $2, $3 ) # and print "explicit: $date|$time|$text" for @data; ], unpack => q[ my( $date, $time, $text ); ( $date, $time, $text ) = unpack 'A10 x A8 x A*', $_ # and print "unpack: $date|$time|$text" for @data; ], }); __END__ P:\test>362106 Rate unpack explicit greedy unpack 158/s -- -41% -53% explicit 267/s 70% -- -21% greedy 338/s 114% 26% --

    What stupidity am I guilty of?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      Strange indeed. I get similar results, although with smaller differences. However, if I change the tests (but not the regexes or the data) slightly, I do get the results where unpack wins:
      #! perl -slw use strict; use Benchmark qw [cmpthese]; our @data = map{ join' ', '2004-05-13', '14:02:00', ('blah') x (1 + rand (9)) } 1 .. 1000; our (@greedy, @explicit, @unpack); cmpthese (-1, { greedy => '@greedy = map {/(^\S*)\s(\S*)\s(.*$)/} @data +', explicit => '@explicit = map {/(^\d{4}\-\d{2}\-\d{2})\s (\d{2}:\d{2}:\d{2})\s(.*$)/x} @data +', unpack => '@unpack = map {unpack "A10 x A8 x A*" => $_} @data +', }); die unless "@greedy" eq "@explicit" && "@greedy" eq "@unpack"; __END__ Rate explicit greedy unpack explicit 86.1/s -- -6% -25% greedy 91.6/s 6% -- -20% unpack 114/s 33% 25% --

      Abigail

        Curiouser and curiouser. The change of tests, changes the relative performance of the methods, but it also slows all of them down.

        I thought that using globals instead of lexicals might have been part of the difference, and it is, but only a small part.

        #! perl -slw use strict; use Benchmark qw[ cmpthese ]; our $TEST ||= 0; our $N = $TEST ? 10 : $N || 1000; our @data = map{ join' ', '2004-05-13', '14:02:00', ('blah') x (1+rand +( 9 )) } 1 .. $N; our (@greedy, @explicit, @unpack); cmpthese( $TEST ? 1 : -1, { our_g => '@greedy = map {/(^\S*)\s(\S*)\s(.*$)/} @data' +, our_e => '@explicit = map {/(^\d{4}\-\d{2}\-\d{2})\s (\d{2}:\d{2}:\d{2})\s(.*$)/x} @data +', our_u => '@unpack = map {unpack "A10 x A8 x A*" => $_} @data' +, my_g => 'my @greedy = map {/(^\S*)\s(\S*)\s(.*$)/} @da +ta', my_e => 'my @explicit = map {/(^\d{4}\-\d{2}\-\d{2})\s (\d{2}:\d{2}:\d{2})\s(.*$)/x} @data +', my_u => 'my @unpack = map {unpack "A10 x A8 x A*" => $_} @da +ta', greedy => q[ my( $date, $time, $text ); m[(^\S*)\s(\S*)\s(.*$)] and ( $date, $time, $text ) = ( $1, $2, $3 ) # and $TEST and print "greedy: $date|$time|$text" for @data; ], explicit => q[ my( $date, $time, $text ); m[(^\d{4}\-\d{2}\-\d{2})\s(\d{2}:\d{2}:\d{2})\s(.*$)] and ( $date, $time, $text ) = ( $1, $2, $3 ) # and $TEST and print "explicit: $date|$time|$text" for @data; ], unpackA => q[ use bytes; my( $date, $time, $text ); ( $date, $time, $text ) = unpack 'A10 x A8 x A*', $_ # and $TEST and print "unpackA: $date|$time|$text" for @data; ], substr => q[ use bytes; my( $date, $time, $text ); ( $date, $time, $text ) = ( substr( $_, 0, 10 ), substr( $_, 11, 8 ), substr( $_, 20 ) ) # and $TEST and print "substr: $date|$time|$text" for @data; ], }); __END__ P:\test>362106 Rate our_e our_g our_u my_e my_g my_u unpackA substr expli +cit greedy our_e 72.4/s -- -2% -15% -28% -30% -43% -55% -73% - +77% -79% our_g 73.6/s 2% -- -13% -27% -29% -42% -54% -73% - +77% -79% our_u 85.0/s 17% 15% -- -16% -18% -33% -47% -69% - +73% -75% my_e 101/s 39% 37% 19% -- -3% -20% -37% -63% - +68% -71% my_g 104/s 43% 41% 22% 3% -- -18% -35% -62% - +67% -70% my_u 126/s 74% 71% 48% 25% 21% -- -21% -53% - +60% -64% unpackA 160/s 121% 117% 88% 59% 54% 27% -- -41% - +49% -54% substr 270/s 273% 267% 218% 168% 160% 114% 69% -- - +14% -22% explicit 314/s 334% 327% 270% 212% 203% 149% 96% 16% + -- -9% greedy 346/s 378% 370% 307% 243% 234% 175% 116% 28% +10% --

        It would be interesting to see the benchmark run on 5.6.2 (pre-unicodification), which I don't have installed currently.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
      I was going to suggest substr() (which is around twice as fast as any of these to extract just these three pieces of data, perfectly formatted) - until I saw the real regex being used. I can't imagine substr, unpack or any rigidly formatted extraction method is any use at all for that lot.

        The thing that really queers the pitch is the quoted field 2/3rds of the way down each line. Without that, you could use the magic form of split and it would (probably) be quicker. As it is, I can't see any way of improving the performance over the long, but actually quite fast regex.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re^2: fast greedy regex
by grinder (Bishop) on Jun 08, 2004 at 11:20 UTC
    unpack is worth a look

    Sometimes, sometimes not. Unpack still has to reparse the format string each time it is called (unless the results are cached internally, but last time I looked they weren't). This cost can add up, which is why substr sometimes outperforms it.


    One other remark to the OP:

    Matching a user-agent string with /(\".*\")/ is pretty horrendous, but there's not much you can do, given that there are some more-or-less malicious useragent strings that contain " themselves. When I ran into this problem years ago the only elegant way I found to deal with it reliably was to walk forwards up to the opening double quote isolating the various fields, walk backwards from the end of the string isolating the other fields (the Referrer if memory serves correctly) and what remained was the User agent.

    This can be done nicely with

    $front = substr( $_, 0, index($_, '"' ) - 1, '' ); $back = substr( $_, rindex( $_, '"' ) + 1, '' ); $user_agent = $_; # modulo a quote or space or two

    The above fragment is non-tested and may contain a fencepost error, but you get the idea. Once this is out of the way it should be possible to construct non-backtracking regexps to match what's left in $front and $back.

    - another intruder with the mooring of the heat of the Perl