comment on

sflitman:

I'd suggest that there could be some "magic" involved. Are you familiar with the Boyer-Moore string search algorithm? I recall being surprised when I originally saw it. A typical index function might be implemented as a typical character-by-character search: e.g. Look for the first character in the target that matches the first character in the string to find, then do a string comparison at each point you find it. It seems pretty fast, and at first glance, it doesn't seem to be possible to improve it by much.

But after you read about the Boyer-Moore algorithm, you see that it's quite possible for you to the search for the string while looking at only a fraction of the string. (In the linked Wikipedia article, you'll see that in their example search, the typical case is to skip eight characters for the next character to examine. In my mind, that nearly qualifies as magic!) With all the work involved in building the state machine for searching, it's a sure bet that they take advantage of all the good tricks to make the search faster.

Now, I haven't evaluated the benchmark you created (I haven't drunk my morning coffee yet), but you might want to play with it a bit to vary the size of your search string, your buffer size, and the percentage of hits to see if you can figure out the "rules" that make one better than the other for (a) your machine, (b) the current version of perl, etc.

Looking at degenerate cases, it's easy to see that the mundane index implementation is going to be fastest with tiny search strings and tiny buffers that you execute only once. After all, in this case there are only a few compares, and there's no overhead of building a state machine. But once you work with very large strings, and you can re-use your state machine, the Boyer-Moore algorithm is obviously better. The overhead of building the state machine is significant, but if you can look at only 10% of your buffer, you can save some serious cycles. But where the dividing line is between them shifts as the technology advances--I have no idea where the border is right now.

...roboticus

In reply to Re: is index faster than regexp for fixed text token? by roboticus
in thread is index faster than regexp for fixed text token? by sflitman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.