Re: Re: Hash/Array of Regular Expressions?

One way you can speed up matching on lots of regular expressions (assuming that matching is infrequent) is to combine them with alternation. This cuts down on the number of matches you need to do. This can improve the speed lots (over 12 times as fast here, with 91 matches in 45424 lines of text):

$ perl multire.pl
Benchmark: running using alternates, using foreach, each for at least 30 CPU seconds...
using alternates: 30 wallclock secs (30.07 usr +  0.09 sys = 30.16 CPU) @  2.69/s (n=81)
using foreach: 30 wallclock secs (30.09 usr +  0.02 sys = 30.11 CPU) @  0.20/s (n=6)
                 s/iter    using foreach using alternates
using foreach      5.02               --             -93%
using alternates  0.372            1248%               --

Here's the code:

#!/usr/bin/perl -w
use strict;
use Benchmark;

#$ wc /usr/share/dict/words
#  45424   45424  409276 /usr/share/dict/words
my $file = '/usr/share/dict/words';

# These callbacks could do whatever you want.
# They have the matching line passed in.
sub foundThe { }
sub foundAeig { }
sub foundCat { }

my %specs = (
    '^the'    => \&foundThe,
    'at[ei]g' => \&foundAeig,
    'cat$'    => \&foundCat,
);

my %cookedSpecs = map { qr{$_}, $specs{$_} } keys(%specs);

sub usingForeach
{
    $count = 0;
    open FILE, $file or die "can't open $file: $!\n";
    while (<FILE>)
    {
        foreach my $pattern ( keys(%cookedSpecs) )
        {
            if (m{$pattern})
            {
                $cookedSpecs{$pattern}->($_);
            }
        }
    }
    close FILE;
}

sub usingAlternates
{
    my $altpattern = join ( '|', map {qr{$_}} keys(%specs) );
    $count = 0;
    open FILE, $file or die "can't open $file: $!\n";

    while (<FILE>)
    {
        if (m{$altpattern})
        {
            foreach my $pattern ( keys(%cookedSpecs) )
            {
                if (m{$pattern})
                {
                    $cookedSpecs{$pattern}->($_);
                }
            }
        }
    }
    close FILE;
}

Benchmark::cmpthese(
    -30,
    {
        'using foreach'    => \&usingForeach,
        'using alternates' => \&usingAlternates
    }
);
[download]

update: clarified callbacks: they don't have to do the same thing.

Comment on Re: Re: Hash/Array of Regular Expressions? Download Code

Replies are listed 'Best First'.
Re: Re: Re: Hash/Array of Regular Expressions? by Aighearach (Initiate) on Jun 25, 2001 at 01:01 UTC
True, one regex is faster than a hash of regexes... certainly any case that can be accomplished with one shouldn't be squeezed into a data structure. But, I think it is bad to have a hash with closures as values where the closures are repeated and required to be the same... better might be to stick the regexes in an array in that case, and just use $count++ or a regular function instead of anonymous functions. Once you get it distilled that way, to whatever simplicity your problem domain allows, I suspect you'll almost always be able to pick one or the other based on simplicity... OTOH, a 6 mile regex could be tough to debug... what would be nice would be a module that would let you put in regexes, and tell it to do the matching one way or the other, that way if you've got a lot of them you can switch to the looping mode when you need to find the broken bits. -- Snazzy tagline here	[reply]
Re: Re: Re: Re: Hash/Array of Regular Expressions? by bikeNomad (Priest) on Jun 25, 2001 at 01:50 UTC
Sorry to mislead you; the callbacks don't have to do the same thing. I just had them that way so I could verify that the code worked and get the count of matching lines. I've changed it to use different callbacks. I suspect that a matching module like you describe could be adaptive: it could change its strategy depending on the hit/miss ratio and the number of regexes it had to match.	[reply]
Re: Re: Re: Re: Re: Hash/Array of Regular Expressions? by Aighearach (Initiate) on Jun 25, 2001 at 02:05 UTC
I wasn't really meaning the actual text of your example, as it's just an example, I'm just saying that in this case whether you actually need the callbacks to be different is what determines the best data structure to use. -- Snazzy tagline here	[reply]