Re^2: does code help regex match numeric ranges?

I did some benchmarking of this solution, comparing it to a dumb join '|', 0 .. 255.

Update: and benchmarked mwah's solution as well.

The non-backtracking solution really pays off:

              Rate brute_force        mwah     grinder
brute_force  125/s          --        -25%        -92%
mwah         167/s         33%          --        -89%
grinder     1520/s       1114%        812%          --
[download]

However demerphq++'s Trie optimization (I hope I remembered the name correctly) in perl5.10 reduced that advantage significantly - so far, that it becomes faster than the code assertions:

              Rate        mwah brute_force     grinder
mwah         156/s          --        -32%        -87%
brute_force  229/s         47%          --        -81%
grinder     1181/s        659%        416%          --
[download]

Removing the /o-flags from the regexes makes the gap a bit wider.

I used this script for benchmarking:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);


my $grinder = sub {
    for (0 .. 1000){
        if ("$_" =~ m/^(?:2(?:[6-9]|5[0-5]?|[0-4]\d?)?|[3-9]\d?|1(?:\d
+\d?)?|0)$/){
            if ($_ > 255){
                print "Fails for $_\n";
            } 
        } else {
            if ($_ < 255){
                print "Fails for $_\n";
            }
        }
    }
};

my $mwah = sub {
    my $re = qr{
            ((?>\d+))            # what to capture, don't backtrack: (
++?> )
            (??{
                $1<255           # what is looked for
                ? ''             # if yes, let the regex succeed
                : '(?!)'         # if no, let the regex bail
            }) 
          }x;
    for (0 .. 1000){
        if ("$_" =~ m/^$re$/){
            if ($_ > 255){
                print "Fails for $_\n";
            } 
        } else {
            if ($_ < 255){
                print "Fails for $_\n";
            }
        }
    }
};

my $bruteforce = sub {
    my $re = join '|', 0 .. 255;
#    print $re, "\n";
    for (0 .. 1000){
        if ("$_" =~ m{^(?:$re)$}){
            if ($_ > 255){
                print "Brute force Fails for $_\n";
            } 
        } else {
            if ($_ < 255){
                print "Brute force Fails for $_\n";
            }
        }
    }
};

cmpthese(-3, {
        grinder        => $grinder,
        mwah        => $mwah,
        brute_force    => $bruteforce,
        });
[download]

Comment on Re^2: does code help regex match numeric ranges? Select or Download Code

Replies are listed 'Best First'.
Re^3: does code help regex match numeric ranges? by mwah (Hermit) on Nov 04, 2007 at 13:10 UTC
moritz I used this script for benchmarking: Thanks for doing this work already (I started to think about that too;-) As it looks, the dynamic code assertion is somehow slow in this context. I suspected this but it'd be interesting if this disadvantage starts somewhere to disappear eventually. I 'compacted' your benchmark code a bit and tried to expand the context, eg. to match on all numbers from 0..10,000 which are below 2,567. This was easily made into the bruteforce and mwah variants but I failed to put this (working) into the 'grinder-sub' (maybe somebody can try). On a larger range, it's clear that the bruteforce-sub approach starts to fail. The beauty of the dynamic code assertion (its slowness will somehow 'disappear' on larger ranges) is the expressiveness which is somehow in contrast to the 'reinvention' of number parser (grinder-sub). Of course, if the problematic range is constant and the developer is able to provide a non-backtracking 'grinder-like' solution, this can't be beaten by anything. This is my shortened variant (w/grinder defunct) version of your benchmark code: ... use Benchmark qw(:all); our $bruteforce_re = 0; # give it the extra preparation bonus my $grinder = sub { my $fails = 0; for( 0 .. 10000 ) { ++$fails if ! /^(?:2(?:[6-9]\|5[0-5]?\|[0-4]\d?)?\|[3-9]\d?\|1(?:\d\ +d?)?\|0)$/o } die "grinder $fails:" if $fails != 10000-2566 # did we get this +right? }; my $mwah = sub { my $fails = 0; my $re = qr{^(\d+)$(??{$1<=2566?'':'(?!)'})}x; for( 0 .. 10000 ) { ++$fails if ! /$re/ } die "mwah $fails:" if $fails != 10000-2566 # did we get this +right? }; my $bruteforce = sub { $bruteforce_re = '^(?:' . join('\|', 0 .. 2566) . ')$' unless $brute +force_re; my $fails = 0; for( 0 .. 10000 ) { ++$fails if ! /$bruteforce_re/o } die "bruteforce $fails:" if $fails != 10000-2566 # did we get this +right? }; cmpthese(-3, {# grinder => $grinder, mwah => $mwah, brute_force => $bruteforce, }); ... [download] In this range, the dynamic code assertion is already 5 times faster (here) than the bruteforce approach. Regards mwa	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: does code help regex match numeric ranges?
by mwah (Hermit) on Nov 04, 2007 at 13:10 UTC

moritz

I used this script for benchmarking:

Thanks for doing this work already (I started to think about that too;-)

As it looks, the dynamic code assertion *is* somehow slow in this context. I suspected this but it'd be interesting if this disadvantage starts somewhere to disappear eventually.

I 'compacted' your benchmark code a bit and tried to expand the context, eg. to match on all numbers from 0..10,000 which are below 2,567.

This was easily made into the bruteforce and mwah variants but I failed to put this (working) into the 'grinder-sub' (maybe somebody can try). On a larger range, it's clear that the bruteforce-sub approach starts to fail. The beauty of the dynamic code assertion (its slowness will somehow 'disappear' on larger ranges) is the expressiveness which is somehow in contrast to the 'reinvention' of number parser (grinder-sub). Of course, if the problematic range is constant and the developer is able to provide a non-backtracking 'grinder-like' solution, this can't be beaten by anything.

This is my shortened variant (w/grinder defunct) version of your benchmark code:

...
use Benchmark qw(:all);
our $bruteforce_re = 0; # give it the extra preparation bonus

my $grinder = sub {
   my $fails = 0;
   for( 0 .. 10000 ) {
      ++$fails if ! /^(?:2(?:[6-9]|5[0-5]?|[0-4]\d?)?|[3-9]\d?|1(?:\d\
+d?)?|0)$/o
   }
   die "grinder $fails:" if $fails != 10000-2566    # did we get this 
+right?
};

my $mwah = sub {
   my $fails = 0;
   my $re = qr{^(\d+)$(??{$1<=2566?'':'(?!)'})}x;
   for( 0 .. 10000 ) {
      ++$fails if  ! /$re/
   }
   die "mwah $fails:" if $fails != 10000-2566       # did we get this 
+right?
};

my $bruteforce = sub {
   $bruteforce_re = '^(?:' . join('|', 0 .. 2566) . ')$' unless $brute
+force_re;
   my $fails = 0;
   for( 0 .. 10000 ) {
      ++$fails if  ! /$bruteforce_re/o
   }
   die "bruteforce $fails:" if $fails != 10000-2566 # did we get this 
+right?
};

cmpthese(-3, {# grinder     => $grinder,
                mwah        => $mwah,
                brute_force => $bruteforce,
             });
...
[download]

In this range, the dynamic code assertion is already 5 times faster (here) than the bruteforce approach.

Regards

mwa

[reply]
[d/l]