neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

Brothers and Sisters,

Suppose I currently have something like this:

my @regex = map {qr /$_/ } qw/ regex1 regex2 . . regex 140 /; # @array has 20000 entries foreach $i (@array) { for $x (@regex) { push @match, $x if $i =~ m/$x/; } }

There must be a more efficient way to do this. I was thinking it would be better to loop through @regex first:

foreach $x (@regex){ @bad_matches = grep ! m/$x/, @array;

After the grep how would I remove the entries in @bad_matches from @array before the next $x? Would this be faster?

Am I on the right track? Would using the Parallel::ForkManager somehow help? Does anyone have other suggestions?

Neil Watson
watson-wilson.ca

Replies are listed 'Best First'.
Re: Efficiency: Foreach loop and multiple regexs
by broquaint (Abbot) on Sep 13, 2002 at 14:33 UTC
    A better option might be to join the regex array together as a single regex made up of alternations
    my $regex = do { local $" = '|'; qr/(?:@regexes)/; }; my @matches = grep /$regex/, @array;

    HTH

    _________
    broquaint

    update: changed code re neilwatson's clarification

      Or use Regex::PreSuf, which takes such an array and builds a more optimal regex from it.
      --
      Mike
Re: Efficiency: Foreach loop and multiple regexs
by bronto (Priest) on Sep 13, 2002 at 14:51 UTC

    in Advanced Perl Programming it is suggested to create the matching loop on the fly with an eval. That is, if @plain contains the string version of your @regex, then it is suggested to do something like this:

    # UNTESTED CODE BELOW THIS LINE ################# foreach my $re (@plain) { my $cycle = q( foreach my $i (@array) { push @match,$re if $i =~ /).$re.q(/ }) ; eval $code ; }

    This way, every "matching" cycle gets compiled and used as soon as it is needed, and you have optimization since the regexp in the pattern matching is not interpolated at every cycle

    Of course, you could be not so keen on using eval $scalar, but it's a matter of taste (for many definitions of taste :-)

    Ciao!
    --bronto

    # Another Perl edition of a song:
    # The End, by The Beatles
    END {
      $you->take($love) eq $you->made($love) ;
    }

Re: Efficiency: Foreach loop and multiple regexs
by Zaxo (Archbishop) on Sep 13, 2002 at 15:05 UTC

    Your second proposal is not very like your first. The results will differ.

    The first will probably benefit from study $i; before the inner loop. With 140 matches to try, it is likely to pay off ( but Benchmark).

    Consider these questions. Do you want all matches, or just the first match? That means the order of regexen matters. Do you need to retain any information about which regex matched a given @array element? That would be the case if this is a parser of some kind. Do you need to maintain the sequence of @array elements?

    Devise a data structure which makes it easy to store and recover what you need to know. Depending on the kind of data in @array, you may want to push indexes or references to elements rather than copying. That will assist in remembering where each element came from, too.

    If the first match is the important one, last will bail out of the inner loop, saving unwanted tests.

    After Compline,
    Zaxo

Re: Efficiency: Foreach loop and multiple regexs
by blokhead (Monsignor) on Sep 13, 2002 at 14:39 UTC
    After the grep how would I remove the entries in @bad_matches from @array before the next $x? Would this be faster?

    Are you trying to match only the items in @array that match all of your regexes? If so, then I would probably write the loops with the outer one as foreach @regex (since @regex is much smaller) .. Please correct me if I misinterpreted your question.

    foreach $reg (@regex) { my $i = 0; while ($i <= $#array) { if ($array[$i] =~ /$x/) { # successful, so keep $array[$i] in array and move to the +next $i++; } else { # this item didn't pass so we remove it for good and # never have to test it again. splice(@array, $i, 1); } } }
    Something like this might improve your efficiency -- as soon as an item fails a test, it is removed from @array, so we have less and less to loop through each time. Ideally you would want to organize @regex so that the most restrictive expressions (those that fail on the highest number of items in @array) come first -- that would give you the best efficiency.

    Let me know if this is not what you intended for this code snippet.

    blokhead

      I should have been more clear. I'm tyring to match ANY item in @array that matches any regex in @regex.

      Neil Watson
      watson-wilson.ca

        OK, in that case, we can kinda swap logic in the if-statement.
        my @matches; foreach $reg (@regex) { my $i = 0; while ($i <= $#array) { if ($array[$i] =~ /$x/) { # successful, so this was a match! put it somewhere safe # also remove it from @array, we don't need to check it ag +ain push @matches, $array[$i]; splice(@array, $i, 1); } else { # this item didn't pass so keep it and try again next rege +x $i++; } } }
        Take successful matches, put them somewhere, and remove them since we don't need to ever check them again. In this scenario, you would want to put the regex which accepts the highest number of items first in @regex, so that more is removed from @array earlier.

        The replies from other monks also are good advice -- combining into one regex is a great idea as well.

        blokhead

Re: Efficiency: Foreach loop and multiple regexs
by jkahn (Friar) on Sep 13, 2002 at 15:50 UTC
    neilwatson said:
    foreach $x (@regex){ @bad_matches = grep ! m/$x/, @array; }
    After the grep how would I remove the entries in @bad_matches from @array before the next $x? Would this be faster?

    I would suggest that you use the following mechanism (incorporating some ideas from other contributors):

    foreach my $regex (@regexes) { @array = grep { ! m/$regex/ } @array; }

    If you really need to keep @bad_matches (those that *did* match the regexes) then you probably want to run the loop manually, as somebody else suggested:

    my (@array); # is populated however you originally did my (@keepers); my (@bad); ITEM: # label added while (@array) { my $item = shift @array; foreach my $regex (@regexes) { if ($item =~ $regex) { push @bad, $item; next ITEM; # return to label } } push @keepers, $item; } # if you want it back on @array, just uncomment: # @array = @keepers;

    HTH, -- jkahn

    Update:(09:37 PDT) added ITEM: label and next ITEM line. Whoops. Code will be substantially sub-optimal without them.

      I like this, it is pretty fast and yet readable:
      #!/usr/local/bin/perl my @regexs = ('test12','test2','test3','test4'); my $filename = '/var/tmp/testdata'; my @compregex = (); my (%match, %nomatch) = (); # Pre-compile regexes, sorting smaller to larger # (shorter prolly will match more than a longer =) foreach my $regex (sort { length($a) <=> length($b) } @regexs) { print "Building regex for $regex\n"; push @compregex, qr/$regex/; } open FILENAME, $filename; # could rewrite this with map, used while and foreach cause # most people know how they work, some dont understand map =) while (<FILENAME>) { chomp; foreach my $test (@compregex) { ($match{"$_"})++ if (/$test/) ; last if $match{"$_"}; #found a match no need t +o go further } ($nomatch{"$_"})++ unless $match{"$_"}; } print "Matches:\n"; foreach my $X (keys %match) { print "$X\n"; } print "no Matches:\n"; foreach my $X (keys %nomatch) { print "$X\n"; }


      -Waswas-fng
Re: Efficiency: Foreach loop and multiple regexs
by Dinosaur (Beadle) on Sep 13, 2002 at 18:40 UTC

    Just to clarify before I start: What you want is to find the subset of @array elements which match any regex -- right?

    Regex::PreSuf suggested above looks interesting, if your regexes really are just a list of words to match.

    Otherwise, no matter which way you nest the loops, you can expect to have to do (140*20000)/2 regex matches on the average (/2 because you get out on the first match). You can optimise the loop (e.g., with study suggested above), but the payoff has to be in running it a lot less times.

    To do that, you might think about ordering the regex list. (Here I'm assuming that the regex list is constant and the data varies from run to run). If you know something about the structure of the incoming data, you should be able to guess with some accuracy which regexes are most likely to match the most data entries. Put those at the front of the list. Depending on how much trouble you're willing to go to over this, you might even conduct some experiments on the data to find the best regexes.

    Also consider ranking the regexes fastest-first, a determination you can probably get pretty close to by eyeball.

    --Dinosaur

Re: Efficiency: Foreach loop and multiple regexs
by Aristotle (Chancellor) on Sep 14, 2002 at 03:46 UTC
    broquaint's method is the most common one. You could also use eval to chain greps together.
    my @grep = map "grep /$_/,", qw/ regex1 regex2 . . regex 140 /; my @match = eval join ' ', @grep, '@array';
    For maximum efficiency make sure to sort the regex array by descending likelihood of a match.

    Makeshifts last the longest.