I had trouble understanding your shift values, but that's because you're looking at the next byte instead of trying to gain information from the current position of the needle. For me 00011010 would mean: 0 bits shift, check on the left of the string for the first mismatch.
Anyway, your mismatch table is wrong: if you have 00001101 that does not mean a 32 bits shift, that's a 9 bits shift, because this sequence of bit is present in your needle. And 00000000 is a 28 bits shift, because its rightmost 4 bits match the beginning of the needle.
Edit: I implemented a subset of Boyer-Moore for a string of bits below. Except since I was lazy, I used string of '0' and '1', so I realized that's nearly just the "normal" Boyer-Moore algorithm that works on characters. So the implementation has no further interest than the description of the algorithm on wikipedia. The point though is that instead of having a table of shifts required for a given byte, you can have a table of shifts required for the position of the rightmost mismatch. This is a little less efficient for small needles, but far easier to compute, and still better than brute force.
For longer needles, you have either a table of 2**N (with N the length of the suffix you want to check) possible values and the shift required, or a table that can easily be of the same length as the needle, that holds a shift value for each possible mismatch position (using the rightmost one).
Here is an implementation using only the bad-char rule. In C I suppose using the bitwise XOR operator could be used to find the rightmost mismatch between two bitstrings. You can either provide two strings as parameters, or let the script generate a haystack and extract a needle.
use v5.14; use Data::Dump qw/pp/; # In all this code the position 0 of an array of bits # holds the rightmost bit in the string # Finds the lowest position of a mismatch, or returns -1 sub match(++) { for (0..$#{$_[0]}) { return $_ if $_[0][$_] ne $_[1][$_]; } return -1; } # Arrays of reversed bits ($haystack[0] is the rightmost bit of the bi +tstring) my @haystack = @ARGV ? reverse split //, shift : map { int rand 2 } 1 +..80; # Unless the strings were specified by the user, take needle as part o +f the haystack my $right = int(15 + rand 30); my $len = int(15 + rand 20); my @needle = @ARGV ? reverse split //, shift : @haystack[$right..$r +ight+$len]; # $pos[0] will contain the positions of 0 bits, and $pos[1] the positi +ons of 1 bits my @pos = ([], []); for (0..$#needle) { push $pos[$needle[$_]], $_; } # Allow the bit beyond the length of needle to be either 1 or 0 push $pos[0], scalar @needle; push $pos[1], scalar @needle; pp @pos; # If we are expecting bit B at position N but got !B # the next occurence of !B in the needle will be # $shift[N] bits to the left my @shift; for (0..$#needle) # For each bit in needle (from the right) { my $bit = !$needle[$_]; # Remove from @pos all reverse bits with a lower position shift $pos[$bit] while $pos[$bit][0] <= $_; # Push difference between the position of the next matching # bit and the position of the mismatch push @shift, $pos[$bit][0]-$_; } say "$_ => $shift[$_]" for 0..$#shift; # Start with needle at the left my $position = @haystack - @needle; # If there is no match, a bruteforce solution will go through every # position from the start my $bruteforce = $position + 1; my $needle = reverse @needle; my $haystack = reverse @haystack; my $steps = 0; say "Needle should match at:", $position - $-[0] if $haystack =~ /$nee +dle/; while ($position >= 0) { $steps++; my @substring = @haystack[$position..$position+$#needle]; my $match = match @substring, @needle; last if $match < 0; $position -= $shift[$match]; } say $haystack; if ($position >= 0) { say ' ' x (@haystack - @needle - $position), $needle; say "Match found at pos: $position"; $bruteforce -= $position; } else { say "No match found"; } say "Steps needed: $steps"; say "Steps required for bruteforce: $bruteforce";
The longer the sequences of identical bits, the better this algorythm will do compared with bruteforce. Using the good suffix rule of the BM algorythm would avoid the worst case scenario with a needle like 01010101.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by BrowserUk (Patriarch) on Apr 05, 2015 at 17:37 UTC | |
by Anonymous Monk on Apr 06, 2015 at 00:37 UTC | |
by BrowserUk (Patriarch) on Apr 06, 2015 at 01:14 UTC | |
by Anonymous Monk on Apr 06, 2015 at 06:30 UTC | |
by Anonymous Monk on Apr 06, 2015 at 06:31 UTC |