comment on

Firstly, you are paying for the cost of the construction of the assembled pattern each time through the loop. In practise one would do this only once per run. Hoisting that out of the benchmarked code would make the figures more accurate.

Nope. Then the benchmark will be flawed. Constructing the pattern should be inside the benchmarked code. But note that constructing the assembled pattern is done outside of the grep - it's done once for each set of regexes to be matched. Just like it happens in practise.

If you have /bin/, /bat/, /bar/, /bong/, ... it is rather wasteful to match against all four and still have it fail just because the target string happens to be is 'bone'.

Indeed, if I use /bin/, /bat/, /bar/, /bong/, I get the following results from the benchmark:

        Rate regex    ra
regex 9.90/s    --  -30%
ra    14.2/s   43%    --
[download]

due to all 4 patterns starting with the same substring. But changing 'bong' to 'pong', it already flips the other way:

        Rate    ra regex
ra    5.25/s    --  -49%
regex 10.2/s   95%    --
[download]

So even with 3 out of 4 strings starting with the same character, the assembled pattern loses. If we go /bin/, /hat/, /car/, /pong/, we get:

        Rate    ra regex
ra    4.45/s    --  -56%
regex 10.1/s  126%    --
[download]

Increasing it to ten simple regexes, I get:

        Rate    ra regex
ra    2.53/s    --  -48%
regex 4.89/s   93%    --
[download]

And going to 20, I get:

        Rate    ra regex
ra    1.74/s    --  -35%
regex 2.69/s   54%    --
[download]

(words used: bin, hat, car, pong, zap, digit, foo, umbrella, apple, cherry, red, blue, white, green, yellow, orange, brown, purple, violet, black). This suggests that eventually, using a long bunch of simple regexes is slower than a long complicated one, but that it takes quite a lot for it to be faster.

Now, I'm not claiming that a bunch of simpler regexes are always faster, not at all. All I'm saying is that the trade-off isn't as clear cut as you presented it.

In reply to Re^4: Regex and question of design by Anonymous Monk
in thread Regex and question of design by amaguk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.