in reply to perl performance vs egrep

Any suggestions to tune this?

Cool. I can offer a suggestion that uses a module I wrote. Regexp::Assemble is designed to optimise expressions like this. Consider the following:

use Regexp::Assemble; my $r = Regexp::Assemble->new; while( <DATA> ) { chomp; $r->add( $_ ); } print $r->as_string; # produces ^(?:K[LM]|P[AM]|S[LZ]|CP|ME|WX|YZ)XX1 __DATA_ ^CPXX1 ^KLXX1 ^KMXX1 ^MEXX1 ^PAXX1 ^PMXX1 ^SLXX1 ^SZXX1 ^WXXX1 ^YZXX1

You can also do this as a one liner:

print Regexp::Assemble ->new( chomp=> 1 ) ->add( <DATA> ) ->as_string;

The other thing you can do is take the assembled expression, remove the ?: to turn it into a POSIX-compatible expression, and feed that to egrep. (see Aristotle's words of wisdom below).


update: hmm, I see that this produces the same thing that other posters have done by hand. Bear in mind that R::A is more at home when dealing with hundreds or thousands of discrete expressions: that is where it starts to shine. With 10 expressions it's like a car in first gear. Still, if your real-world environment is using more than you show here, you will see a definite improvement.

Another thing it's good at is when your expressions aren't so, um, regular. Consider what happens when PMXX1 is changed to PMXX2. (Hint: ^(?:(?:K[LM]|S[LZ]|CP|ME|WX|YZ)XX1|P(?:AXX1|MXX2))). That's not quite as easy to work out by hand.

- another intruder with the mooring in the heart of the Perl

Replies are listed 'Best First'.
Re^2: perl performance vs egrep
by Aristotle (Chancellor) on Jan 23, 2005 at 23:07 UTC

    The other thing you can do is take the assembled expression, remove the ?: to turn it into a POSIX-compatible expression, and feed that to egrep.

    Except that makes zero difference because egrep uses a DFA engine (as opposed to Perl's NFA.) To a DFA engine it doesn't matter which of any number of equivalent regexen you use. So long as they all match the exact same things, all of them will be translated to the exact same state machine.

    Makeshifts last the longest.

Re^2: perl performance vs egrep
by demerphq (Chancellor) on Jan 25, 2005 at 18:06 UTC

    I just thought id mention that assuming you are talking about regexes of the form /^(LIST|OF|LITERALS)/ (ie no regex special characters involved and left anchored at the start of the string) then once you get over a small handful of words (last time i checked it was around 50 or so) you can actually outperform perls regex engine with a pure perl trie. Perl really doesnt handle this type of pattern very well currently, and as a pet project im working on creating two new regops, TRIE and DFA which will basically do all of this type of optimization automatically. (Which will have the disadvantage that it will make regexs like your optimized one actually perform worse than a straight list of options.) Incidentally you will probably find that if you omit the class logic the regex will run faster. As someone else mentioned already, classes disable a lot of optimizations that the engine can do.

    ---
    demerphq