Random_Walk has asked for the wisdom of the Perl Monks concerning the following question:

I have a bunch of regex lines (about 700) which I need to test against a logfile, I was provided the lines by a third party who at first just said they are regex. I have coded a nice fast Perl solution that builds a block of code to eval so preventing the regex being recompiled every time and it is good. Now the other party has informed me the regex are egrep. I have searched the web, google and super search for diferences between the two and find egrep uses a DFA engine while perl uses an NFA. Perl provides back refs and capturing, egrep does not

My questions are

Here is a snippet of one of the regex that looked odd to me, it is looking for an overheat on a CPU
rather hot at on ([^C]|C[^P]|CP[^U])
Sorry but this is a sanitized version, the stuff if intellectual property of another company.

I really don't want to have to re-write my code and dread the thought of 700 system calls to egrep !! If I have to is there any efficient way I can call egrep 700 times or should I re-write in a DFA regex language (awk?)

Update

The line I gave as looking for CPU overheats is of course looking for anything BUT CPU overheats, thanks ysth

Thanks to all for the input, TedPride's lists were a good start and ambrus got me thinking along the lines of things perl does which egrep does not being a problem, ysth provided some more examples. Ovid got to the root of the problem, poorly provided specs (here is some regex. what sort ? oh just <shrug> regex) sigh. Happy-the-monk commented on the performance hit of forking many egreps recomending I stay in a single perl thread to do all my matches.

What I am going to do is combine a few tests for literals in egrep that are special in perl into part of the procedure for installing new pattern files

perl -ne 'print "prob with $_" if /(\\[a-tA-T])|other tests/'
Damn, I just found three occurances of [ \t] in the patterns, time for s/\\([a-tA-T])/\\\\\1/ Any suggestions for more substitutions ?

Update 2

Looking at mastering regular expressions, chap 5 if find this little gem too....(bear in mind here perl regex is a traditional NFA while egrep is a DFA)

What text will actually be matched by tour|to|tournament when applied to the string `three·tournaments·won'? All the alternatives are attempted (and fail) during each attempt (at the 1st character position, 2nd, 3rd, and so on) until the transmission starts the attempt at `three·|tournaments·won'. This time, the first alternative, tour, matches. Since the alternation is the last thing in the regex, the moment the tour matches, the whole regex is done. The other alternatives are not even tried again.

So, we see that alternation is not greedy, at least not for an NFA. Well, to be specific, alternation is not greedy for a Traditional NFA. Greedy alternation would have matched the longest possible alternative (tournament), wherever in the list it happened to be. A POSIX NFA, or any DFA would have indeed done just that, but I'm getting a bit ahead of myself.

So I think I need another little gem to find all alternations and sort them by size of literal so the longest possible match is returned. sounds like a task full of pitfalls. Any ideas how I can efficiently shell out 700 egreps a few times a second ? I thought perhaps I could generate a shell script with perl conatining all the egreps and returning the line number of the one it matched then just fork once to this shell but of course the shell still forks a child for each egrep, no win there :-(

Cheers,
R.

Replies are listed 'Best First'.
Re: differnce between egrep and perl regex ?
by TedPride (Priest) on Oct 12, 2004 at 10:59 UTC
    http://www.sorgonet.com/linux/regular-expressions/
    http://home.earthlink.net/~alanconnor/ed/regex.html

    As near as I can tell, Perl regex is egrep with slightly more features. For instance, Perl allows $1 $2 etc for matched values in parentheses, while egrep does not (though \1 \2 works in both? - see syntax). Perl also allows min/max number of matches with {4,5}, while egrep does not. Bottom line is that your egreps should hopefully work the same running through Perl regex, though Perl regex might not work running through egrep.

Re: differnce between egrep and perl regex ?
by ambrus (Abbot) on Oct 12, 2004 at 11:38 UTC

    The most important incompatibility between egrep regexps and perl regexps it the braces.

    GNU egrep uses \{N,K\} for repetitions, while GNU grep -E and perl uses {N,K}. This might be different in other versions of grep. (Update: nevyn is right, both perl and egrep uses {N,K}.)

    Also, backslashes are ordinary characters inside a bracketed character set, but are special in perl, so you must escape backslashes inside brackets when converting from egrep to perl.

    Update: also note that egrep is a DFA matcher (unless you use backreferences) so some regexps might have to be changed to work with perl (which is a NFA), especially some regexps that contain repetitions embedded in each other.

    Update: from perl 5.10.0 you can replace the regex engine with any other, including a posix one. The modules re::engine::TRE and re::engine::POSIX does this. You may need to use the x flag to get the egrep syntax instead of the grep syntax.

      GNU egrep uses \{N,K\} for repetitions, while GNU grep -E and perl uses {N,K}.

      GNU egrep and grep -E use the same setup code for the grep core. So I don't see how they can be different.

      --
      James Antill
Re: differnce between egrep and perl regex ?
by ysth (Canon) on Oct 12, 2004 at 12:15 UTC
    That odd-looking regex is matching "rather hot at on " followed by pretty much anything that isn't CPU, and that will work in perl as well (though /rather hot at on (?!CPU)/ would be a more natural way to do it in perl).

    Perl regexs are a superset of egrep, and the extensions perl offers use syntax that is illegal as an egrep regex, according to what I see about egrep in SUSv3. (The only possible exception is that, while SUSv3 does allow the {2}, {2,}, and {1,2} quantifiers, it doesn't specify that a ? afterward has any special meaning, as it does in perl.)

      Of course you are right about it being an anything but CPU match, I saw CPU in there and assumed it was looking in a very odd way for CPU rather than an egrep efficient match on (?!CPU).

      I guess my problem will be innocent litteral characters in the egrep regex that are not special there but that perl interprets as special, such as the {M,N}? where the ? is special in perl and \d, \w etc which egrep knows nothing about. I could just hope the odds are low.

      It looks like a non-trivial problem to be sure there is nothing perl regexp-ish that is not egrep regex, anyone know if there is a conversion module/tool already written, I only need do the conversion once everytime the patterns are updated (less than annual event)

      Cheers,
      R.

        You might just scan your regexes for /\\[a-z]/, but I really don't think in practice you will find anything that actually needs conversion.

        Update: Reading more closely, I see a few areas of minor concern for character classes; perl doesn't support collating symbols (e.g. [[.fu.]]) or equivalence class expressions (e.g. [[=ą=]]), and perl treats backslash in a character class as an escaping character but egrep should treat it as a literal backslash.

Re: differnce between egrep and perl regex ?
by Happy-the-monk (Canon) on Oct 12, 2004 at 11:12 UTC

    Alas, I can't answer your questions. =( TedPride came up with something useful; I have this vague thought to contribute:

    If the system calls  egrep,  it will spend resources on  forking which might well compete with the load that egrep would do on the machine. -- So if you're going to rewrite the process of spawning the egrep, consider rewriting it to use a single call to  perl  to do all the greping?

    Cheers, Sören

Re: differnce between egrep and perl regex ?
by Ovid (Cardinal) on Oct 12, 2004 at 13:48 UTC

    If possible, it's always best to fix the problem at the source. "Regular" expressions are far from regular. As you've already discovered, there are DFA and NFA engines. DFA engines are much faster, but can't match sections of the text to variables. Also, is this alledgedly a PCRE (Perl Compatible Regular Expression)? Is it from egrep or awk? There are many different regular expression engines out there. Someone telling you that "this is a regex" is the same thing as someone handing you code you don't recognize and saying "this is a programming language."

    Cheers,
    Ovid

    New address of my CGI Course.

      Ovid

      At first they told me "this is regex, we think its Perl", (I got the pattern files through our non-technical account manager). I asked for confirmation from the techies and I now get the reply (two months later, summer hols all round) that it is egrep regex. Of course we go live ASAP and I have a nice, Perl based solution developed and can think of little worse than a re-write to shell/egrep or perl-forking egreps.

      Cheers,
      R.

Re: differnce between egrep and perl regex ?
by maard (Pilgrim) on Oct 13, 2004 at 07:14 UTC
    Anyone know a nice doc listing the differences ?

    The best doc out there is Mastering Regular Expression by By J.Friedl, published by O'Reilly.

Re: differnce between egrep and perl regex ?
by maard (Pilgrim) on Oct 16, 2004 at 08:21 UTC
    So I think I need another little gem to find all alternations and sort them by size of literal so the longest possible match is returned. sounds like a task full of pitfalls. Any ideas how I can efficiently shell out 700 egreps a few times a second ? I thought perhaps I could generate a shell script with perl conatining all the egreps and returning the line number of the one it matched then just fork once to this shell but of course the shell still forks a child for each egrep, no win there :-(

    If you want speed then forking for each of 700 egreps isn't good, of course. Depending on complexity of your rexegps you can or cannot just use some DFA to NFA conversion tool. So probably manual conversion of all regexes worth it, especially if new regexps are added too seldom, as I understood from your post.
    As for longest match, just reorder alternations from longest to shortest. Also, in MRE there's an example of finding rightmost longest match, maybe you will find that useful.