in reply to is index faster than regexp for fixed text token?

sflitman:

I'd suggest that there could be some "magic" involved. Are you familiar with the Boyer-Moore string search algorithm? I recall being surprised when I originally saw it. A typical index function might be implemented as a typical character-by-character search: e.g. Look for the first character in the target that matches the first character in the string to find, then do a string comparison at each point you find it. It seems pretty fast, and at first glance, it doesn't seem to be possible to improve it by much.

But after you read about the Boyer-Moore algorithm, you see that it's quite possible for you to the search for the string while looking at only a fraction of the string. (In the linked Wikipedia article, you'll see that in their example search, the typical case is to skip eight characters for the next character to examine. In my mind, that nearly qualifies as magic!) With all the work involved in building the state machine for searching, it's a sure bet that they take advantage of all the good tricks to make the search faster.

Now, I haven't evaluated the benchmark you created (I haven't drunk my morning coffee yet), but you might want to play with it a bit to vary the size of your search string, your buffer size, and the percentage of hits to see if you can figure out the "rules" that make one better than the other for (a) your machine, (b) the current version of perl, etc.

Looking at degenerate cases, it's easy to see that the mundane index implementation is going to be fastest with tiny search strings and tiny buffers that you execute only once. After all, in this case there are only a few compares, and there's no overhead of building a state machine. But once you work with very large strings, and you can re-use your state machine, the Boyer-Moore algorithm is obviously better. The overhead of building the state machine is significant, but if you can look at only 10% of your buffer, you can save some serious cycles. But where the dividing line is between them shifts as the technology advances--I have no idea where the border is right now.

...roboticus

Replies are listed 'Best First'.
Re^2: is index faster than regexp for fixed text token?
by sflitman (Hermit) on Jul 05, 2009 at 16:21 UTC
    Excellent thoughts, thank you. The question for me is in squeezing more performance out of a mod_perl web application which needs to check for named options in tiny search strings, so perhaps the mundane index implementation is better there. It is faster for me to type regexp matches, so I'll probably use them more, and regexp matches can have /i so I can make them case-insensitive which with index requires lc on both arguments.

    I studied Boyer-Moore in my first C Algorithms class under the very demanding Dr. Wong. I'll never forget him, he took points off my code for forgetting to comment, and I learned as much as 50% of my programming craft from him. Nearly failed that class! I was such a geek and I signed up for it as a freshman, when he taught it like a graduate course. I got the impression that he didn't like me...but that probably made me study harder! He also showed me the XOR trick for swapping two registers without an intermediary:

    $a^=$b; $b^=$a; $a^=$b; # swap a and b

    SSF

      sflitman:

      I'd definitely update the benchmark to test cases that are close to what you're using in your application, just to make sure you get the best result. Include the lc calls, as well, so you can see if they push you to the regex side.

      Oh, yeah, if you're going to do other matches as well, be sure to see if you can incorporate them into the regex to let it get even more performance for you.

      Regarding your class--I always found it best to take the course with the teachers other students considered "the hardest", as I found I usually learned more from them. And because of their reputation, I'd keep up on the notes & homework. It's amazing how easy a difficult class is if you keep up on the reading and homework.

      ...roboticus

      Update: Added second paragraph.

      "He also showed me the XOR trick for swapping two registers without an intermediary":
      $a^=$b; $b^=$a; $a^=$b; # swap a and b

      This "trick" is part of first level class in 'C'. It is shown to demonstrate power of XOR. It looks cool but it is inefficient and in general not a good idea even though it produces a correct result. XOR is "more expensive", meaning takes longer than other simple bitwise ops like OR or AND or NOT. Anyway this construct is just demonstrated to explain XOR, it is not practical is not used even in 'C'.

      In Perl, this is bad code! - just wrong.

      The Perl way: ($a,$b)=($b,$a);
      the above is practical and could be used.

        I agree; I like the Perl way better because it is more expressive and readable, plus it makes more sense to my XOR-incapable neurons. (Actually, I heard a lecture somewhere that neuron dendrites compute analog boolean-like operations such as AND and NOT on their neural inputs).

        SSF