Random_Walk has asked for the wisdom of the Perl Monks concerning the following question:
I have a bunch of regex lines (about 700) which I need to test against a logfile, I was provided the lines by a third party who at first just said they are regex. I have coded a nice fast Perl solution that builds a block of code to eval so preventing the regex being recompiled every time and it is good. Now the other party has informed me the regex are egrep. I have searched the web, google and super search for diferences between the two and find egrep uses a DFA engine while perl uses an NFA. Perl provides back refs and capturing, egrep does not
My questions are
Sorry but this is a sanitized version, the stuff if intellectual property of another company.rather hot at on ([^C]|C[^P]|CP[^U])
I really don't want to have to re-write my code and dread the thought of 700 system calls to egrep !! If I have to is there any efficient way I can call egrep 700 times or should I re-write in a DFA regex language (awk?)
The line I gave as looking for CPU overheats is of course looking for anything BUT CPU overheats, thanks ysth
Thanks to all for the input, TedPride's lists were a good start and ambrus got me thinking along the lines of things perl does which egrep does not being a problem, ysth provided some more examples. Ovid got to the root of the problem, poorly provided specs (here is some regex. what sort ? oh just <shrug> regex) sigh. Happy-the-monk commented on the performance hit of forking many egreps recomending I stay in a single perl thread to do all my matches.
What I am going to do is combine a few tests for literals in egrep that are special in perl into part of the procedure for installing new pattern files
Damn, I just found three occurances of [ \t] in the patterns, time for s/\\([a-tA-T])/\\\\\1/ Any suggestions for more substitutions ?perl -ne 'print "prob with $_" if /(\\[a-tA-T])|other tests/'
Looking at mastering regular expressions, chap 5 if find this little gem too....(bear in mind here perl regex is a traditional NFA while egrep is a DFA)
What text will actually be matched by tour|to|tournament when applied to the string `three·tournaments·won'? All the alternatives are attempted (and fail) during each attempt (at the 1st character position, 2nd, 3rd, and so on) until the transmission starts the attempt at `three·|tournaments·won'. This time, the first alternative, tour, matches. Since the alternation is the last thing in the regex, the moment the tour matches, the whole regex is done. The other alternatives are not even tried again.
So, we see that alternation is not greedy, at least not for an NFA. Well, to be specific, alternation is not greedy for a Traditional NFA. Greedy alternation would have matched the longest possible alternative (tournament), wherever in the list it happened to be. A POSIX NFA, or any DFA would have indeed done just that, but I'm getting a bit ahead of myself.
So I think I need another little gem to find all alternations and sort them by size of literal so the longest possible match is returned. sounds like a task full of pitfalls. Any ideas how I can efficiently shell out 700 egreps a few times a second ? I thought perhaps I could generate a shell script with perl conatining all the egreps and returning the line number of the one it matched then just fork once to this shell but of course the shell still forks a child for each egrep, no win there :-(
Cheers,
R.
|
|---|