Re: Efficiency: Foreach loop and multiple regexs
by broquaint (Abbot) on Sep 13, 2002 at 14:33 UTC
|
A better option might be to join the regex array together as a single regex made up of alternations
my $regex = do {
local $" = '|';
qr/(?:@regexes)/;
};
my @matches = grep /$regex/, @array;
HTH
_________ broquaint
update: changed code re neilwatson's clarification | [reply] [d/l] |
|
|
Or use Regex::PreSuf, which takes such an array and builds a more optimal regex from it.
--
Mike
| [reply] |
Re: Efficiency: Foreach loop and multiple regexs
by bronto (Priest) on Sep 13, 2002 at 14:51 UTC
|
in Advanced Perl Programming it is suggested to create the matching loop on the fly with an eval. That is, if @plain contains the string version of your @regex, then it is suggested to do something like this:
# UNTESTED CODE BELOW THIS LINE #################
foreach my $re (@plain) {
my $cycle = q(
foreach my $i (@array) {
push @match,$re if $i =~ /).$re.q(/
}) ;
eval $code ;
}
This way, every "matching" cycle gets compiled and used as soon as it is needed, and you have optimization since the regexp in the pattern matching is not interpolated at every cycle
Of course, you could be not so keen on using eval $scalar, but it's a matter of taste (for many definitions of taste :-)
Ciao! --bronto
# Another Perl edition of a song:
# The End, by The Beatles
END {
$you->take($love) eq $you->made($love) ;
}
| [reply] [d/l] [select] |
Re: Efficiency: Foreach loop and multiple regexs
by Zaxo (Archbishop) on Sep 13, 2002 at 15:05 UTC
|
Your second proposal is not very like your first. The results will differ.
The first will probably benefit from study $i; before the inner loop. With 140 matches to try, it is likely to pay off ( but Benchmark).
Consider these questions. Do you want all matches, or just the first match? That means the order of regexen matters. Do you need to retain any information about which regex matched a given @array element? That would be the case if this is a parser of some kind. Do you need to maintain the sequence of @array elements?
Devise a data structure which makes it easy to store and recover what you need to know. Depending on the kind of data in @array, you may want to push indexes or
references to elements rather than copying. That will assist in remembering where each element came from, too.
If the first match is the important one, last will bail out of the inner loop, saving unwanted tests.
After Compline, Zaxo
| [reply] |
Re: Efficiency: Foreach loop and multiple regexs
by blokhead (Monsignor) on Sep 13, 2002 at 14:39 UTC
|
After the grep how would I remove the entries in @bad_matches from @array before the next $x? Would this be faster?
Are you trying to match only the items in @array that match all of your regexes? If so, then I would probably write the loops with the outer one as foreach @regex (since @regex is much smaller) .. Please correct me if I misinterpreted your question.
foreach $reg (@regex) {
my $i = 0;
while ($i <= $#array) {
if ($array[$i] =~ /$x/) {
# successful, so keep $array[$i] in array and move to the
+next
$i++;
} else {
# this item didn't pass so we remove it for good and
# never have to test it again.
splice(@array, $i, 1);
}
}
}
Something like this might improve your efficiency -- as soon as an item fails a test, it is removed from @array, so we have less and less to loop through each time. Ideally you would want to organize @regex so that the most restrictive expressions (those that fail on the highest number of items in @array) come first -- that would give you the best efficiency.
Let me know if this is not what you intended for this code snippet.
blokhead | [reply] [d/l] [select] |
|
|
| [reply] |
|
|
OK, in that case, we can kinda swap logic in the if-statement.
my @matches;
foreach $reg (@regex) {
my $i = 0;
while ($i <= $#array) {
if ($array[$i] =~ /$x/) {
# successful, so this was a match! put it somewhere safe
# also remove it from @array, we don't need to check it ag
+ain
push @matches, $array[$i];
splice(@array, $i, 1);
} else {
# this item didn't pass so keep it and try again next rege
+x
$i++;
}
}
}
Take successful matches, put them somewhere, and remove them since we don't need to ever check them again. In this scenario, you would want to put the regex which accepts the highest number of items first in @regex, so that more is removed from @array earlier.
The replies from other monks also are good advice -- combining into one regex is a great idea as well.
blokhead | [reply] [d/l] |
|
|
Re: Efficiency: Foreach loop and multiple regexs
by jkahn (Friar) on Sep 13, 2002 at 15:50 UTC
|
foreach $x (@regex){
@bad_matches = grep ! m/$x/, @array;
}
After the grep how would I remove the entries in @bad_matches from @array before the next $x? Would this be faster?
I would suggest that you use the following mechanism (incorporating some ideas from other contributors): foreach my $regex (@regexes) {
@array = grep { ! m/$regex/ } @array;
}
If you really need to keep @bad_matches (those that *did* match the regexes) then you probably want to run the loop manually, as somebody else suggested: my (@array); # is populated however you originally did
my (@keepers);
my (@bad);
ITEM: # label added
while (@array) {
my $item = shift @array;
foreach my $regex (@regexes) {
if ($item =~ $regex) {
push @bad, $item;
next ITEM; # return to label
}
}
push @keepers, $item;
}
# if you want it back on @array, just uncomment:
# @array = @keepers;
HTH, -- jkahn
Update:(09:37 PDT) added ITEM: label and next ITEM line. Whoops. Code will be substantially sub-optimal without them. | [reply] [d/l] [select] |
|
|
I like this, it is pretty fast and yet readable:
#!/usr/local/bin/perl
my @regexs = ('test12','test2','test3','test4');
my $filename = '/var/tmp/testdata';
my @compregex = ();
my (%match, %nomatch) = ();
# Pre-compile regexes, sorting smaller to larger
# (shorter prolly will match more than a longer =)
foreach my $regex (sort { length($a) <=> length($b) } @regexs) {
print "Building regex for $regex\n";
push @compregex, qr/$regex/;
}
open FILENAME, $filename;
# could rewrite this with map, used while and foreach cause
# most people know how they work, some dont understand map =)
while (<FILENAME>) {
chomp;
foreach my $test (@compregex) {
($match{"$_"})++ if (/$test/) ;
last if $match{"$_"}; #found a match no need t
+o go further
}
($nomatch{"$_"})++ unless $match{"$_"};
}
print "Matches:\n";
foreach my $X (keys %match) {
print "$X\n";
}
print "no Matches:\n";
foreach my $X (keys %nomatch) {
print "$X\n";
}
-Waswas-fng | [reply] [d/l] |
Re: Efficiency: Foreach loop and multiple regexs
by Dinosaur (Beadle) on Sep 13, 2002 at 18:40 UTC
|
Just to clarify before I start: What you want is to find
the subset of @array elements which match any regex -- right?
Regex::PreSuf suggested above looks interesting, if your
regexes really are just a list of words to match.
Otherwise, no matter which way you nest the loops,
you can expect to have to do (140*20000)/2 regex matches
on the average (/2 because you get out on the first match).
You can optimise the loop (e.g., with study suggested above), but the payoff has to be in running it
a lot less times.
To do that, you might think about ordering the regex
list. (Here I'm assuming that the regex list is constant and the data varies from run to run). If you know something about the structure of the incoming data, you should be able to guess with some accuracy which regexes are most likely to match the most data entries. Put those at the front of the list. Depending on how much trouble you're willing to go to over this, you might even conduct some experiments on the data to find the best regexes.
Also consider ranking the regexes fastest-first, a determination you can probably get pretty close to by eyeball.
--Dinosaur
| [reply] [d/l] |
Re: Efficiency: Foreach loop and multiple regexs
by Aristotle (Chancellor) on Sep 14, 2002 at 03:46 UTC
|
broquaint's method is the most common one. You could also use eval to chain greps together.
my @grep = map "grep /$_/,", qw/
regex1
regex2
.
.
regex 140
/;
my @match = eval join ' ', @grep, '@array';
For maximum efficiency make sure to sort the regex array by descending likelihood of a match.
Makeshifts last the longest. | [reply] [d/l] |