comment on

I have a bunch of regex lines (about 700) which I need to test against a logfile, I was provided the lines by a third party who at first just said they are regex. I have coded a nice fast Perl solution that builds a block of code to eval so preventing the regex being recompiled every time and it is good. Now the other party has informed me the regex are egrep. I have searched the web, google and super search for diferences between the two and find egrep uses a DFA engine while perl uses an NFA. Perl provides back refs and capturing, egrep does not

My questions are

Is egrep regex a sub-set of perl regex ?
Anyone know a nice doc listing the differences ?
How much danger is there of random egrep regex causing the NFA engine to stick in a loop ?

Here is a snippet of one of the regex that looked odd to me, it is looking for an overheat on a CPU

rather hot at on ([^C]|C[^P]|CP[^U])
[download]

Sorry but this is a sanitized version, the stuff if intellectual property of another company.

I really don't want to have to re-write my code and dread the thought of 700 system calls to egrep !! If I have to is there any efficient way I can call egrep 700 times or should I re-write in a DFA regex language (awk?)

Update

The line I gave as looking for CPU overheats is of course looking for anything BUT CPU overheats, thanks ysth

Thanks to all for the input, TedPride's lists were a good start and ambrus got me thinking along the lines of things perl does which egrep does not being a problem, ysth provided some more examples. Ovid got to the root of the problem, poorly provided specs (here is some regex. what sort ? oh just <shrug> regex) sigh. Happy-the-monk commented on the performance hit of forking many egreps recomending I stay in a single perl thread to do all my matches.

What I am going to do is combine a few tests for literals in egrep that are special in perl into part of the procedure for installing new pattern files

perl -ne 'print "prob with $_" if /(\\[a-tA-T])|other tests/'
[download]

Damn, I just found three occurances of [ \t] in the patterns, time for s/\\([a-tA-T])/\\\\\1/ Any suggestions for more substitutions ?

Update 2

Looking at mastering regular expressions, chap 5 if find this little gem too....(bear in mind here perl regex is a traditional NFA while egrep is a DFA)

What text will actually be matched by tour|to|tournament when applied to the string `three·tournaments·won'? All the alternatives are attempted (and fail) during each attempt (at the 1st character position, 2nd, 3rd, and so on) until the transmission starts the attempt at `three·|tournaments·won'. This time, the first alternative, tour, matches. Since the alternation is the last thing in the regex, the moment the tour matches, the whole regex is done. The other alternatives are not even tried again.

So, we see that alternation is not greedy, at least not for an NFA. Well, to be specific, alternation is not greedy for a Traditional NFA. Greedy alternation would have matched the longest possible alternative (tournament), wherever in the list it happened to be. A POSIX NFA, or any DFA would have indeed done just that, but I'm getting a bit ahead of myself.

So I think I need another little gem to find all alternations and sort them by size of literal so the longest possible match is returned. sounds like a task full of pitfalls. Any ideas how I can efficiently shell out 700 egreps a few times a second ? I thought perhaps I could generate a shell script with perl conatining all the egreps and returning the line number of the one it matched then just fork once to this shell but of course the shell still forks a child for each egrep, no win there :-(

Cheers,
R.

In reply to differnce between egrep and perl regex ? by Random_Walk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.