Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

I had trouble understanding your shift values, but that's because you're looking at the next byte instead of trying to gain information from the current position of the needle. For me 00011010 would mean: 0 bits shift, check on the left of the string for the first mismatch.

Anyway, your mismatch table is wrong: if you have 00001101 that does not mean a 32 bits shift, that's a 9 bits shift, because this sequence of bit is present in your needle. And 00000000 is a 28 bits shift, because its rightmost 4 bits match the beginning of the needle.

Edit: I implemented a subset of Boyer-Moore for a string of bits below. Except since I was lazy, I used string of '0' and '1', so I realized that's nearly just the "normal" Boyer-Moore algorithm that works on characters. So the implementation has no further interest than the description of the algorithm on wikipedia. The point though is that instead of having a table of shifts required for a given byte, you can have a table of shifts required for the position of the rightmost mismatch. This is a little less efficient for small needles, but far easier to compute, and still better than brute force.

For longer needles, you have either a table of 2**N (with N the length of the suffix you want to check) possible values and the shift required, or a table that can easily be of the same length as the needle, that holds a shift value for each possible mismatch position (using the rightmost one).

Here is an implementation using only the bad-char rule. In C I suppose using the bitwise XOR operator could be used to find the rightmost mismatch between two bitstrings. You can either provide two strings as parameters, or let the script generate a haystack and extract a needle.

use v5.14;
use Data::Dump qw/pp/;

# In all this code the position 0 of an array of bits
# holds the rightmost bit in the string

# Finds the lowest position of a mismatch, or returns -1
sub match(++)
{
    for (0..$#{$_[0]})
    {
        return $_ if $_[0][$_] ne $_[1][$_];
    }
    return -1;
}

# Arrays of reversed bits ($haystack[0] is the rightmost bit of the bi
+tstring)
my @haystack =  @ARGV ? reverse split //, shift : map { int rand 2 } 1
+..80;

# Unless the strings were specified by the user, take needle as part o
+f the haystack
my $right = int(15 + rand 30);
my $len   = int(15 + rand 20); 
my @needle =    @ARGV ? reverse split //, shift : @haystack[$right..$r
+ight+$len];


# $pos[0] will contain the positions of 0 bits, and $pos[1] the positi
+ons of 1 bits 
my @pos = ([], []);

for (0..$#needle)
{
    push $pos[$needle[$_]], $_;
}

# Allow the bit beyond the length of needle to be either 1 or 0
push $pos[0], scalar @needle;
push $pos[1], scalar @needle;


pp @pos;

# If we are expecting bit B at position N but got !B
# the next occurence of !B in the needle will be
# $shift[N] bits to the left
my @shift;

for (0..$#needle) # For each bit in needle (from the right)
{
    my $bit = !$needle[$_];

    # Remove from @pos all reverse bits with a lower position
    shift $pos[$bit] while $pos[$bit][0] <= $_;

    # Push difference between the position of the next matching
    # bit and the position of the mismatch
    push @shift, $pos[$bit][0]-$_;
}

say "$_ => $shift[$_]" for 0..$#shift;

# Start with needle at the left
my $position = @haystack - @needle;

# If there is no match, a bruteforce solution will go through every
# position from the start
my $bruteforce = $position + 1;

my $needle = reverse @needle;
my $haystack = reverse @haystack;
my $steps = 0;

say "Needle should match at:", $position - $-[0] if $haystack =~ /$nee
+dle/;

while ($position >= 0)
{
    $steps++;
    my @substring = @haystack[$position..$position+$#needle];
    my $match = match @substring, @needle;
    last if $match < 0;
    $position -= $shift[$match]; 
}

say $haystack;
if ($position >= 0)
{
    say ' ' x (@haystack - @needle - $position), $needle;
    say "Match found at pos: $position";
    $bruteforce -= $position;
}
else
{
    say "No match found";
}

say "Steps needed: $steps";
say "Steps required for bruteforce: $bruteforce";
[download]

The longer the sequences of identical bits, the better this algorythm will do compared with bruteforce. Using the good suffix rule of the BM algorythm would avoid the worst case scenario with a needle like 01010101.

Comment on Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?) Download Code

Replies are listed 'Best First'.
Re^2: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?) by BrowserUk (Patriarch) on Apr 05, 2015 at 17:37 UTC
The point though is that instead of having a table of shifts required for a given byte, you can have a table of shifts required for the position of the rightmost mismatch. The right mismatching what? Single bit? Group of 8 bits? Or 13 bits? 1000001 bits? My point is that a table of single bits 0 or 1 doesn't help; but any larger than that, and you're into the "groups of aligned bits" problem I described above. Whatever alignment you choose, byte/word/other; and whatever group of bits from the mismatched position in the haystack you use as your index, shift the needle 1-bit either way from that alignment and that group is no longer relevant. I don't know how to better describe it; but until you've tried to implement what you are describing -- using bits; not bytes pretending to be bits -- you will not appreciate the problem. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^3: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?) by Anonymous Monk on Apr 06, 2015 at 00:37 UTC
How about making the bits pretend to be bytes? =)	[reply]
Re^4: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?) by BrowserUk (Patriarch) on Apr 06, 2015 at 01:14 UTC
How about making the bits pretend to be bytes? Apart from the fact that a 1GB binary bitstring that fits comfortably in memory would suddenly occupy 8GB and push my machine into swapping; what makes you think searching 8GB of ascii-ized binary would be faster than searching 1GB of binary binary? Okay, so you'd be able to use one or other of the fast string search algorithms, but the trouble with that is, with only 2 x 8-bit patterns involved -- ie. a 2 symbol alphabet -- they simply no longer provide an benefit. So your back to using a brute-force search algorithm; but on 8 x the volume of data. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^5: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?) by Anonymous Monk on Apr 06, 2015 at 06:30 UTC
Re^6: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?) by Anonymous Monk on Apr 06, 2015 at 06:31 UTC