comment on

I agree with the comment that benchmarking is very important. There have been changes as of late in the regex engine and some "old conventional wisdom" may not hold anymore. Performance is in general release dependent (and not always faster with later releases) - speed depends upon the exact situation.

But from previous work that I've done, if you have a bunch of terms that are Or'd together X|Y, that the regex engine will do this more efficiently if it can see them all, rather than you running separate regex X, then Y.

One module that I have works on "regex piece parts". Each small bit is tested separately, Perl builds a humongous regex with all of them Or'd together. That regex gets dynamically compiled and used. For development, I can work on one of the pieces and regression test it before getting the rest of the regex zoo involved.

The ability of Perl to dynamically create a regex and use it is something that can't be done in C#, Java, etc. Sometimes this can work out very well. I have one piece of code that uses substr + some regex stuff + some program logic to write simple somewhat overlapping Or terms to search for specific things. This has helped me in some situations where I'm trying to match "sort of like" XYZ.

Anyway consider the possibility of program generated dynamic regex. As Larry Wall says, "programs that write programs, are the happiest programs of all".

Update:
I didn't give a clear cut example of dynamic regex, so here's one that is close to a real world situation (its a big simplification of actual code): let's say that I am trying to find the word ABCD, but according to the matching rules, I am going to allow one of the letters to be wrong, for example AXCD matches. Now lets say that furthermore, I will allow a single pair of letters to be transposed (counts as one combined error). It is easy to algorithmically generate the combo's: ABCD .BCD A.CD AB.D ABC. BACD ACBD ...etc. If I use a program to generate this long sequence of Or'd terms, when the first letter is not an A, then the regex engine will immediately rule out ABCD A.CD AB.D... etc. The regex engine builds a state machine that is pretty sophisticated and it will execute quickly even if there are 30 terms in the "dumb" regex. If somebody here knows how to write a general regex that runs as quickly or actually even if you can just do it at all with one general regex, I'd like to hear about it! Regex should be able to look for words with 3,4,5,6 letters. My regex kung-foo is not up to that job.

In reply to Re: Multiple Regex evaluations or one big one? by Marshall
in thread Multiple Regex evaluations or one big one? by flyerhawk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.