in reply to is index faster than regexp for fixed text token?
I'd suggest that there could be some "magic" involved. Are you familiar with the Boyer-Moore string search algorithm? I recall being surprised when I originally saw it. A typical index function might be implemented as a typical character-by-character search: e.g. Look for the first character in the target that matches the first character in the string to find, then do a string comparison at each point you find it. It seems pretty fast, and at first glance, it doesn't seem to be possible to improve it by much.
But after you read about the Boyer-Moore algorithm, you see that it's quite possible for you to the search for the string while looking at only a fraction of the string. (In the linked Wikipedia article, you'll see that in their example search, the typical case is to skip eight characters for the next character to examine. In my mind, that nearly qualifies as magic!) With all the work involved in building the state machine for searching, it's a sure bet that they take advantage of all the good tricks to make the search faster.
Now, I haven't evaluated the benchmark you created (I haven't drunk my morning coffee yet), but you might want to play with it a bit to vary the size of your search string, your buffer size, and the percentage of hits to see if you can figure out the "rules" that make one better than the other for (a) your machine, (b) the current version of perl, etc.
Looking at degenerate cases, it's easy to see that the mundane index implementation is going to be fastest with tiny search strings and tiny buffers that you execute only once. After all, in this case there are only a few compares, and there's no overhead of building a state machine. But once you work with very large strings, and you can re-use your state machine, the Boyer-Moore algorithm is obviously better. The overhead of building the state machine is significant, but if you can look at only 10% of your buffer, you can save some serious cycles. But where the dividing line is between them shifts as the technology advances--I have no idea where the border is right now.
...roboticus
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: is index faster than regexp for fixed text token?
by sflitman (Hermit) on Jul 05, 2009 at 16:21 UTC | |
by roboticus (Chancellor) on Jul 05, 2009 at 17:31 UTC | |
by Marshall (Canon) on Jul 06, 2009 at 06:27 UTC | |
by sflitman (Hermit) on Jul 11, 2009 at 01:41 UTC | |
by Marshall (Canon) on Jul 13, 2009 at 00:27 UTC |