in reply to Hash/Array of Regular Expressions?

I use lists of hashes as fast "objects" in parsers a lot. I like compiled regexes in this case, because the way I use them I am putting changed versions in a new slot anyway. To do this I use the quote regex operator, qr//

my @parsers; sub add_parser { my $name = shift; my $re = shift; my %parser = ( name => $name, re => qr/$re/, code => sub { print "Hi, I'm parser $name, who are you?" }, ); push @parsers, \%parser; } add_parser( "HelloWorld", "foo" ); # ... my $find_this = "ffoobar"; foreach my $parser ( @parsers ) { if ( $find_this =~ $parser->{re} ) { &{$parser->{code}}; } }
It gets ugly really fast, though... real OO is "better," except when its too slow, like if you have hundreds of parsers. Though the whole thing could be hidden behind a object, but that's a religious war I'm not qualified to fight; I like it both ways.
--
Snazzy tagline here

Replies are listed 'Best First'.
Re: Re: Hash/Array of Regular Expressions?
by bikeNomad (Priest) on Jun 24, 2001 at 23:14 UTC
    One way you can speed up matching on lots of regular expressions (assuming that matching is infrequent) is to combine them with alternation. This cuts down on the number of matches you need to do. This can improve the speed lots (over 12 times as fast here, with 91 matches in 45424 lines of text):
    $ perl multire.pl
    Benchmark: running using alternates, using foreach, each for at least 30 CPU seconds...
    using alternates: 30 wallclock secs (30.07 usr +  0.09 sys = 30.16 CPU) @  2.69/s (n=81)
    using foreach: 30 wallclock secs (30.09 usr +  0.02 sys = 30.11 CPU) @  0.20/s (n=6)
                     s/iter    using foreach using alternates
    using foreach      5.02               --             -93%
    using alternates  0.372            1248%               --
    

    Here's the code:

    #!/usr/bin/perl -w use strict; use Benchmark; #$ wc /usr/share/dict/words # 45424 45424 409276 /usr/share/dict/words my $file = '/usr/share/dict/words'; # These callbacks could do whatever you want. # They have the matching line passed in. sub foundThe { } sub foundAeig { } sub foundCat { } my %specs = ( '^the' => \&foundThe, 'at[ei]g' => \&foundAeig, 'cat$' => \&foundCat, ); my %cookedSpecs = map { qr{$_}, $specs{$_} } keys(%specs); sub usingForeach { $count = 0; open FILE, $file or die "can't open $file: $!\n"; while (<FILE>) { foreach my $pattern ( keys(%cookedSpecs) ) { if (m{$pattern}) { $cookedSpecs{$pattern}->($_); } } } close FILE; } sub usingAlternates { my $altpattern = join ( '|', map {qr{$_}} keys(%specs) ); $count = 0; open FILE, $file or die "can't open $file: $!\n"; while (<FILE>) { if (m{$altpattern}) { foreach my $pattern ( keys(%cookedSpecs) ) { if (m{$pattern}) { $cookedSpecs{$pattern}->($_); } } } } close FILE; } Benchmark::cmpthese( -30, { 'using foreach' => \&usingForeach, 'using alternates' => \&usingAlternates } );

    update: clarified callbacks: they don't have to do the same thing.

      True, one regex is faster than a hash of regexes... certainly any case that can be accomplished with one shouldn't be squeezed into a data structure. But, I think it is bad to have a hash with closures as values where the closures are repeated and required to be the same... better might be to stick the regexes in an array in that case, and just use $count++ or a regular function instead of anonymous functions. Once you get it distilled that way, to whatever simplicity your problem domain allows, I suspect you'll almost always be able to pick one or the other based on simplicity...

      OTOH, a 6 mile regex could be tough to debug... what would be nice would be a module that would let you put in regexes, and tell it to do the matching one way or the other, that way if you've got a lot of them you can switch to the looping mode when you need to find the broken bits.
      --
      Snazzy tagline here

        Sorry to mislead you; the callbacks don't have to do the same thing. I just had them that way so I could verify that the code worked and get the count of matching lines. I've changed it to use different callbacks.

        I suspect that a matching module like you describe could be adaptive: it could change its strategy depending on the hit/miss ratio and the number of regexes it had to match.