irom has asked for the wisdom of the Perl Monks concerning the following question:

Is there a way to take a series of regular expressions, and store them into an array or hash hash? If so, how would I use this to create some sort of quick grep sanity test to see if a string matches any of the res stored in the list? A good application for this would be to eliminate garbage in when calling a perl script or cgi. Thanks, Irom

Replies are listed 'Best First'.
Re: Hash/Array of Regular Expressions?
by bikeNomad (Priest) on Jun 23, 2001 at 20:52 UTC
    You can pre-compile a regular expression using the qr// quote-like operator:
    #!/usr/bin/perl -w use strict; my @array = map { qr{$_} } ('^abcd', 'cd[ef]g', 'cat$'); while (<>) { chomp; foreach my $re (@array) { print "Matched $re\n" if m{$re}; } }
        my @array = map { qr{$_} } ('^abcd', 'cd[ef]g', 'cat$');
      I'm curious why you chose to use map here. I've used arrays of qr!! (Parse Loops with flat text files. (code)), and I also think I have a good sense of when map is appropriate (Sort a long list of hosts by domain (code)). But in some cases, using map seems to be either obfuscation or bloat for the sake of shorter code.

      Observe:

      # timtowtdi... foreach ( qw{ ^abcd cd[ef]g cat$ } ) { push @array, qr{$_} }; # shorter still ... push @array, qr{$_} for qw{ ^abcd cd[ef]g cat$ }; # or the way I think is most clear and most sane.... @array = ( qr{^abcd}, qr{cd[ef]g}, qr{cat$} );
      Isn't there additional overhead from calling map? Some people are frenzied map lovers. And in some cases, it is the most appropriate way to do something. I just dont see why you used it here.

      Enlighten me?

      brother dep.

      --
      Laziness, Impatience, Hubris, and Generosity.

        Good question. I agree that sometimes map obscures things. However, I find it obscuring when it's used in void context (instead of a foreach loop). However, I also find that using a foreach loop where a mapping is happening obscures things.

        I used map because:<bl>

      • I assumed that the patterns weren't necessarily going to be in the program text (ruling out your literal
        @array = ( qr{^abcd}, qr{cd[ef]g}, qr{cat$} );
        option)
      • To me, the operation reads more clearly this way: I'm transforming an array of strings into an array of regexes by applying an operation to each of them. This is communicated most clearly (IMO) by the map operator. My problem with the foreach is that it tends to obscure the meaning of the code. We see a loop, then we have to decode it to figure out that it is in fact doing the same thing as a map. Kent Beck would call using map intention revealing.
      • I mostly use Smalltalk, where this is the idiomatic way to do it. </bl> In Smalltalk, this operation would be written as:
        regexes := strings collect: [ :ea | Regex new: ea ].
        In Smalltalk, every collection responds to the collect: message, which passes each of the elements of the collecion into a block (equivalent to a Perl CODE ref) whose output is collected into a collection of the same species as the original collection. So Perl's map operator corresponds directly to Smalltalk's collect: methods.

        Also, Perl's grep operator corresponds directly to Smalltalk's select: methods.

        update: changed title because of topic change

        Since you've called me out by name, I guess I have to respond. And I wouldn't call myself a "frenzied map lover". Like any function, map has it's place.

        When is the proper time to use map? I would argue that one should use map whenever you want to apply an expression to all the members of an array, and actually intend to use the array of results.

        Map is contraindicated when you're not going to use the result... use foreach in that case; map adds extra overhead compared to foreach if you aren't going to use the result. (that's the only additional overhead that I know of.) In this case, bikeNomad is using the result, and the code is therefore concise, and correct.

        Dep, you haven't shown any reason why map is a bad idea in this case. I think that clarity here is achieved by separating out the array of regexps, so that they are visually distinct and clear. Then, the code that maps that with qr() is also visually separate. I think this is a good thing.

        Since you implied you wanted it, here's my stylistic criticism of your alternatives:

        @array = ( qr{^abcd}, qr{cd[ef]g}, qr{cat$} );

        Putting a qr{} around each search term is terrible, IMHO. If you had a list with many search terms, it would result in much more typing. Even with a few terms, it means that each search expression that the author is trying to express is wrapped in a little bit of ugliness. (I do appreciate your use of qw{} elsewhere, to reduce quotes.)

        foreach ( qw{ ^abcd cd[ef]g cat$ } ) { push @array, qr{$_} }

        This isn't bad, but recommending it is the same as saying that map() shouldn't exist, since it's exactly the same, except with more typing.

         push @array, qr{$_} for qw{ ^abcd cd[ef]g cat$ };

        This is worst of all, I think, because it relies on the wierd semantic order of things in perl that few other languages implement (like putting the loop conditions after the loop body). Don't get me wrong, I think that sort of thing is cool, and is great for a some circumstances. But really, the point of the backwards syntax is to make perl read more like English. I'd rather my perl code read like C or TCL or lisp than English. Those types of constructs are exactly the sorts of things that make perl hard to read for novice perl programmers. The fact that I can iterate the push after-the-fact like you suggest here is non-obvious to someone coming from another language. Even someome familiar with perl might wonder, whether the precedence rules will do what you want. As it turns out, your code is correct, of course. But someone could easily read it as meaning something like: push @array, { qr{$_} } for qw{ ^abcd cd[ef]g cat$ };, which, of course, is wrong.

        My own stylistic fetishes, aside, you never said what you thought was wrong with using map. What is it that you object to?

        Map has lower overhead than many other list changing algorithms... this is mostly because it uses better, faster, fewer temporary variables. Lets comapare using our good old friend Devel::OpProf.

        #!/usr/bin/perl use warnings; use strict; use Devel::OpProf qw'profile print_stats zero_stats'; my @source = ( 1..10_000 ); my @dest = (); #measure the map profile(1); @dest = map { $_ * 10 } @source; profile(0); print "*** map ***\n"; print_stats(); zero_stats(); @dest = (); #measure the foreach profile(1); foreach(@source){ @dest = $_ * 10; } profile(0); print "\n*** foreach ***\n"; print_stats(); zero_stats(); @dest = (); #measure the for profile(1); push @dest, $_ * 10 for @source; profile(0); print "\n*** for ***\n"; print_stats();

        The output:

        *** map ***
        null operation           10005
        constant item            10001
        scalar variable          10000
        map iterator             10000
        multiplication (*)       10000
        block                    10000
        pushmark                 4
        next statement           2
        private array            2
        list assignment          1
        map                      1
        subroutine entry         1
        glob value               1
        
        *** foreach ***
        null operation           20005
        pushmark                 20002
        next statement           20002
        glob value               10002
        logical and (&&)         10001
        private array            10001
        constant item            10001
        foreach loop iterator    10001
        iteration finalizer      10000
        multiplication (*)       10000
        scalar dereference       10000
        list assignment          10000
        foreach loop entry       1
        subroutine entry         1
        loop exit                1
        
        *** for ***
        next statement           10003
        glob value               10002
        pushmark                 10002
        logical and (&&)         10001
        private array            10001
        constant item            10001
        foreach loop iterator    10001
        multiplication (*)       10000
        push                     10000
        iteration finalizer      10000
        scalar dereference       10000
        null operation           5
        foreach loop entry       1
        subroutine entry         1
        loop exit                1
        

        So we see that, a map has less action than a foreach, and stuffing the for in the push is almost as good as a map, and with many of the same operations going on.
        --
        Snazzy tagline here

Re: Hash/Array of Regular Expressions?
by tadman (Prior) on Jun 23, 2001 at 21:50 UTC
    A string is a regular expression of sorts, so you could simply extract these from a hash and test against them. However, each time through you will have to recompile the pattern, which can be slow going if you do this a lot. Using the qr operator can help a bit, or using the /o switch to prevent compilation, but you might be better off using a hash of subs which just happen to contain regular expressions.

    Here's something that demonstrates my idea:
    my %validate = ( int => sub { my($v)=@_; $v=~/^[0-9]+$/; }, date => sub { my($v)=@_; $v=~/^[12]\d\d\d\-[01]?\d\-[0123]?\d/ +; }, ); foreach my $value (qw [ 24 140 510 2001-04-14 3014-30-55 ]) { print "$value\n"; foreach my $type (sort keys %validate) { print "\tIs '$type'\n" if ($validate{$type}($value)); } }
    If you have many, many different patterns to define, you might want to eval them into the hash, like so:
    $validate{$new_type} = eval "sub { my(\$v)=\@_; \$v=~$regex; }";
    Of course, taking special care to ensure $regex was a self-contained regex (i.e. /x/ or !/!) and did not contain anything that was going to be invalid when eval'd, though of course you can always check $@ and see what went wrong.

    The advantage to using a full sub over just a regex is that you can validate in a context outside of a regex just to be sure. For example, you could check that the day of the month was actually 31 or less, instead of possibly 39.
Re: Hash/Array of Regular Expressions?
by Aighearach (Initiate) on Jun 23, 2001 at 23:55 UTC

    I use lists of hashes as fast "objects" in parsers a lot. I like compiled regexes in this case, because the way I use them I am putting changed versions in a new slot anyway. To do this I use the quote regex operator, qr//

    my @parsers; sub add_parser { my $name = shift; my $re = shift; my %parser = ( name => $name, re => qr/$re/, code => sub { print "Hi, I'm parser $name, who are you?" }, ); push @parsers, \%parser; } add_parser( "HelloWorld", "foo" ); # ... my $find_this = "ffoobar"; foreach my $parser ( @parsers ) { if ( $find_this =~ $parser->{re} ) { &{$parser->{code}}; } }
    It gets ugly really fast, though... real OO is "better," except when its too slow, like if you have hundreds of parsers. Though the whole thing could be hidden behind a object, but that's a religious war I'm not qualified to fight; I like it both ways.
    --
    Snazzy tagline here
      One way you can speed up matching on lots of regular expressions (assuming that matching is infrequent) is to combine them with alternation. This cuts down on the number of matches you need to do. This can improve the speed lots (over 12 times as fast here, with 91 matches in 45424 lines of text):
      $ perl multire.pl
      Benchmark: running using alternates, using foreach, each for at least 30 CPU seconds...
      using alternates: 30 wallclock secs (30.07 usr +  0.09 sys = 30.16 CPU) @  2.69/s (n=81)
      using foreach: 30 wallclock secs (30.09 usr +  0.02 sys = 30.11 CPU) @  0.20/s (n=6)
                       s/iter    using foreach using alternates
      using foreach      5.02               --             -93%
      using alternates  0.372            1248%               --
      

      Here's the code:

      #!/usr/bin/perl -w use strict; use Benchmark; #$ wc /usr/share/dict/words # 45424 45424 409276 /usr/share/dict/words my $file = '/usr/share/dict/words'; # These callbacks could do whatever you want. # They have the matching line passed in. sub foundThe { } sub foundAeig { } sub foundCat { } my %specs = ( '^the' => \&foundThe, 'at[ei]g' => \&foundAeig, 'cat$' => \&foundCat, ); my %cookedSpecs = map { qr{$_}, $specs{$_} } keys(%specs); sub usingForeach { $count = 0; open FILE, $file or die "can't open $file: $!\n"; while (<FILE>) { foreach my $pattern ( keys(%cookedSpecs) ) { if (m{$pattern}) { $cookedSpecs{$pattern}->($_); } } } close FILE; } sub usingAlternates { my $altpattern = join ( '|', map {qr{$_}} keys(%specs) ); $count = 0; open FILE, $file or die "can't open $file: $!\n"; while (<FILE>) { if (m{$altpattern}) { foreach my $pattern ( keys(%cookedSpecs) ) { if (m{$pattern}) { $cookedSpecs{$pattern}->($_); } } } } close FILE; } Benchmark::cmpthese( -30, { 'using foreach' => \&usingForeach, 'using alternates' => \&usingAlternates } );

      update: clarified callbacks: they don't have to do the same thing.

        True, one regex is faster than a hash of regexes... certainly any case that can be accomplished with one shouldn't be squeezed into a data structure. But, I think it is bad to have a hash with closures as values where the closures are repeated and required to be the same... better might be to stick the regexes in an array in that case, and just use $count++ or a regular function instead of anonymous functions. Once you get it distilled that way, to whatever simplicity your problem domain allows, I suspect you'll almost always be able to pick one or the other based on simplicity...

        OTOH, a 6 mile regex could be tough to debug... what would be nice would be a module that would let you put in regexes, and tell it to do the matching one way or the other, that way if you've got a lot of them you can switch to the looping mode when you need to find the broken bits.
        --
        Snazzy tagline here

Re: Hash/Array of Regular Expressions?
by holygrail (Scribe) on Jun 23, 2001 at 20:52 UTC
    You can of course create an array of regexes and use [eval] to run each element of the array when you set $_ to the thing you want to evaluate with that regex.

    --HolyGrail
Re: Hash/Array of Regular Expressions?
by JohnAndy (Initiate) on Jul 05, 2007 at 16:44 UTC
    OK, so the replies show a number of ways of doing this. Now suppose I wanted to do the same thing, but I wanted to have an array of substitutions (i.e. s/something/something_else/) instead of just matches. Putting s/// in the array doesn't work because it executes the substitution at the point where the array is defined - I guess I need a quoting operator equivalent to s///...