The basic premise of (the titled) fast string matching algorithms, is that you preprocess the needle and construct a table of shifts or skips that allow you to skip over chunks of the haystack and thus speed things up.

The following, I hope clear, explanation (they are few and far between) describes a variant called Quick Search.

You start with a haystack and needle:

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 000010100101010000011010

You inspect the needle and build the skip table:

00001010 01010100 00011010 00001010 shift 24 01010100 shift 16 00011010 shift 8 xxxxxxxx shift 32 (all other patterns not found in the needle allow ma +ximum shift)

And apply it. Try the pattern at the start of the haystack; doesn't match, so look up the next 8 bits of the haystack in the table, and find the shift = 32:

000011101001110010111011 01110001 111010111100111110111111100000001000 +100100001010010101000001101000110010011000101101010110011011000000 000010100101010000011010 ???????? shift = 32

So, apply the shift and try the needle at the new position. No match, lookup the next 8 bits of the haystack to find a shift of 32:

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 000010100101010000011010???????? shift + = 32

Apply the shift, try the needle at the new position. No match, look up the next 8 bits, get the shift of 8:

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 000010 +100101010000011010???????? shift = 8

Apply the shift, try the match, And success. Needle found.

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + 000010100101010000011010 Found at 72

Now lets try that with the same haystack but another needle:

10000101 00101010 00001101 10000101 shift 24 00101010 shift 16 00001101 shift 8 xxxxxxxx shift 32 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 100001010010101000001101???????? shift 32 100001010010101000001101???????? shift + 32 100001 +010010101000001101???????? shift 32 + 100001010010101000001101???????? shift 32 + 10000101001 +0101000001101 >>> Not found.

Great! Four compares & four skips to discover the needle isn't in the haystack.

Except that it is!

0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + 100001010010101000001101

And that's why Boyer-Moore, Horspool, Alpha-Skip, Quick Search et. al. don't work for bit-strings.

It doesn't matter what size you make the chunks of bits -- above I used 8 -- the skip table will only ever consider aligned groups of bits, but bit-strings inherently consist entirely of unaligned bits!

(Don't believe me; just try it for yourself.)

And the point of this meditation?

To prove my instincts were right?

Maybe, but mostly because I wanted to be wrong. I wanted there to be a better than brute force mechanism. And I'm still hoping that someone will point me at one.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?) by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.