I'll attempt a brief description of the Alpha Skip Search. Hopefully, this may shed some light on the basic operation of the algorithm.

Consider a pattern of length M. The approach is to quickly locate any fractional substring within the pattern, wherever it may occur. Substrings of length F take offsets from 0 .. M-F, inclusive. Thus there is a set of M-F+1 needle fragments. The preprocessing stage creates an index of those fragments. A perl implementation might read as follows:

push @{$index{ substr($needle, $_, $F) }}, $_     for 0 .. $M-$F;

Search phase proceeds by inspecting the haystack one window at a time. Window shift is the same figure as above, M-F+1. Each time, a substring is extracted at the end of the window. The index now yields all possible alignments where this fragment will match in the pattern.

Theoretical optimum for F is logsM (log based on alphabet size). That amounts to one entry per index bucket, on the average. For random data, there will be one alignment to test per window, and this is likely to fail at the very start.

In the case of bit-string matching, the substrings are nothing but small integers below (1<<F), which suggests a very simple array-based indexing scheme (instead of trie). Here's a sample implementation in C (just a rehash of the previous demo, main difference being this one handles variable length needles).

Tested with gcc/clang under linux; my apologies if the code is too obscure, or otherwise unworkable.

Update. Simplified above code. Note that it is sometimes difficult to put a name on a string search routine, especially with bit-strings. Minor modification can turn one search into another, and the variants abound. Here's a version that might be called Horspool.


In reply to Re^2: [OT] The interesting problem of ... bit-strings (alpha skip search) by oiskuu
in thread [OT] The interesting problem of comparing bit-strings. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.