Efficiency: Foreach loop and multiple regexs

neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficiency: Foreach loop and multiple regexs by broquaint (Abbot) on Sep 13, 2002 at 14:33 UTC
A better option might be to join the regex array together as a single regex made up of alternations `my $regex = do { local $" = '\|'; qr/(?:@regexes)/; }; my @matches = grep /$regex/, @array;` [download] HTH `_________ broquaint` update: changed code re neilwatson's clarification	[reply] [d/l]
Re: Re: Efficiency: Foreach loop and multiple regexs by RMGir (Prior) on Sep 13, 2002 at 14:37 UTC
Or use Regex::PreSuf, which takes such an array and builds a more optimal regex from it. -- Mike	[reply]
Re: Efficiency: Foreach loop and multiple regexs by bronto (Priest) on Sep 13, 2002 at 14:51 UTC
in Advanced Perl Programming it is suggested to create the matching loop on the fly with an `eval`. That is, if `@plain` contains the string version of your `@regex`, then it is suggested to do something like this: `# UNTESTED CODE BELOW THIS LINE ################# foreach my $re (@plain) { my $cycle = q( foreach my $i (@array) { push @match,$re if $i =~ /).$re.q(/ }) ; eval $code ; }` [download] This way, every "matching" cycle gets compiled and used as soon as it is needed, and you have optimization since the regexp in the pattern matching is not interpolated at every cycle Of course, you could be not so keen on using `eval $scalar`, but it's a matter of taste (for many definitions of taste :-) Ciao! `--bronto` # Another Perl edition of a song: # The End, by The Beatles END { $you->take($love) eq $you->made($love) ; }	[reply] [d/l] [select]
Re: Efficiency: Foreach loop and multiple regexs by Zaxo (Archbishop) on Sep 13, 2002 at 15:05 UTC
Your second proposal is not very like your first. The results will differ. The first will probably benefit from `study $i;` before the inner loop. With 140 matches to try, it is likely to pay off ( but Benchmark). Consider these questions. Do you want all matches, or just the first match? That means the order of regexen matters. Do you need to retain any information about which regex matched a given @array element? That would be the case if this is a parser of some kind. Do you need to maintain the sequence of @array elements? Devise a data structure which makes it easy to store and recover what you need to know. Depending on the kind of data in @array, you may want to push indexes or references to elements rather than copying. That will assist in remembering where each element came from, too. If the first match is the important one, last will bail out of the inner loop, saving unwanted tests. After Compline, Zaxo	[reply]
Re: Efficiency: Foreach loop and multiple regexs by blokhead (Monsignor) on Sep 13, 2002 at 14:39 UTC
After the grep how would I remove the entries in @bad_matches from @array before the next $x? Would this be faster? Are you trying to match only the items in @array that match all of your regexes? If so, then I would probably write the loops with the outer one as `foreach @regex` (since @regex is much smaller) .. Please correct me if I misinterpreted your question. `foreach $reg (@regex) { my $i = 0; while ($i <= $#array) { if ($array[$i] =~ /$x/) { # successful, so keep $array[$i] in array and move to the +next $i++; } else { # this item didn't pass so we remove it for good and # never have to test it again. splice(@array, $i, 1); } } }` [download] Something like this might improve your efficiency -- as soon as an item fails a test, it is removed from @array, so we have less and less to loop through each time. Ideally you would want to organize @regex so that the most restrictive expressions (those that fail on the highest number of items in @array) come first -- that would give you the best efficiency. Let me know if this is not what you intended for this code snippet. blokhead	[reply] [d/l] [select]
Re: Re: Efficiency: Foreach loop and multiple regexs by neilwatson (Priest) on Sep 13, 2002 at 14:47 UTC
I should have been more clear. I'm tyring to match ANY item in @array that matches any regex in @regex. Neil Watson watson-wilson.ca	[reply]
Re: Re: Re: Efficiency: Foreach loop and multiple regexs by blokhead (Monsignor) on Sep 13, 2002 at 14:53 UTC
OK, in that case, we can kinda swap logic in the if-statement. `my @matches; foreach $reg (@regex) { my $i = 0; while ($i <= $#array) { if ($array[$i] =~ /$x/) { # successful, so this was a match! put it somewhere safe # also remove it from @array, we don't need to check it ag +ain push @matches, $array[$i]; splice(@array, $i, 1); } else { # this item didn't pass so keep it and try again next rege +x $i++; } } }` [download] Take successful matches, put them somewhere, and remove them since we don't need to ever check them again. In this scenario, you would want to put the regex which accepts the highest number of items first in @regex, so that more is removed from @array earlier. The replies from other monks also are good advice -- combining into one regex is a great idea as well. blokhead	[reply] [d/l]
Re: Re: Re: Re: Efficiency: Foreach loop and multiple regexs by neilwatson (Priest) on Sep 14, 2002 at 16:39 UTC
Re: Efficiency: Foreach loop and multiple regexs by jkahn (Friar) on Sep 13, 2002 at 15:50 UTC
neilwatson said: `foreach $x (@regex){ @bad_matches = grep ! m/$x/, @array; }` [download] After the grep how would I remove the entries in @bad_matches from @array before the next $x? Would this be faster? I would suggest that you use the following mechanism (incorporating some ideas from other contributors): `foreach my $regex (@regexes) { @array = grep { ! m/$regex/ } @array; }` [download] If you really need to keep `@bad_matches` (those that did match the regexes) then you probably want to run the loop manually, as somebody else suggested: `my (@array); # is populated however you originally did my (@keepers); my (@bad); ITEM: # label added while (@array) { my $item = shift @array; foreach my $regex (@regexes) { if ($item =~ $regex) { push @bad, $item; next ITEM; # return to label } } push @keepers, $item; } # if you want it back on @array, just uncomment: # @array = @keepers;` [download] HTH, -- jkahn *Update:(09:37 PDT)* added `ITEM:` label and `next ITEM` line. Whoops. Code will be substantially sub-optimal without them.	[reply] [d/l] [select]
Re: Re: Efficiency: Foreach loop and multiple regexs by waswas-fng (Curate) on Sep 13, 2002 at 16:42 UTC
I like this, it is pretty fast and yet readable: #!/usr/local/bin/perl my @regexs = ('test12','test2','test3','test4'); my $filename = '/var/tmp/testdata'; my @compregex = (); my (%match, %nomatch) = (); # Pre-compile regexes, sorting smaller to larger # (shorter prolly will match more than a longer =) foreach my $regex (sort { length($a) <=> length($b) } @regexs) { print "Building regex for $regex\n"; push @compregex, qr/$regex/; } open FILENAME, $filename; # could rewrite this with map, used while and foreach cause # most people know how they work, some dont understand map =) while (<FILENAME>) { chomp; foreach my $test (@compregex) { ($match{"$_"})++ if (/$test/) ; last if $match{"$_"}; #found a match no need t +o go further } ($nomatch{"$_"})++ unless $match{"$_"}; } print "Matches:\n"; foreach my $X (keys %match) { print "$X\n"; } print "no Matches:\n"; foreach my $X (keys %nomatch) { print "$X\n"; } [download] -Waswas-fng	[reply] [d/l]
Re: Efficiency: Foreach loop and multiple regexs by Dinosaur (Beadle) on Sep 13, 2002 at 18:40 UTC
Just to clarify before I start: What you want is to find the subset of @array elements which match any regex -- right? Regex::PreSuf suggested above looks interesting, if your regexes really are just a list of words to match. Otherwise, no matter which way you nest the loops, you can expect to have to do (140*20000)/2 regex matches on the average (/2 because you get out on the first match). You can optimise the loop (e.g., with `study` suggested above), but the payoff has to be in running it a lot less times. To do that, you might think about ordering the regex list. (Here I'm assuming that the regex list is constant and the data varies from run to run). If you know something about the structure of the incoming data, you should be able to guess with some accuracy which regexes are most likely to match the most data entries. Put those at the front of the list. Depending on how much trouble you're willing to go to over this, you might even conduct some experiments on the data to find the best regexes. Also consider ranking the regexes fastest-first, a determination you can probably get pretty close to by eyeball. --Dinosaur	[reply] [d/l]
Re: Efficiency: Foreach loop and multiple regexs by Aristotle (Chancellor) on Sep 14, 2002 at 03:46 UTC
broquaint's method is the most common one. You could also use eval to chain greps together. `my @grep = map "grep /$_/,", qw/ regex1 regex2 . . regex 140 /; my @match = eval join ' ', @grep, '@array';` [download] For maximum efficiency make sure to sort the regex array by descending likelihood of a match. Makeshifts last the longest.	[reply] [d/l]