in reply to Re^2: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
in thread Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

BrowserUk:

Update 2: I couldn't leave it alone (damned OCD), so I think it works now. I definitely had a few code problems, and the new version looks pretty decent to me. (The variable names could be better, but I'm a bit too tired to mess with it.

It looks like one of my previous editing passes removed the reason that I coded it in Python. At $work, I'm having to use Python and JavaScript to write some stuff. I chose the lesser of two evals.

(Believed) correct code:

#!/usr/bin/env python import pprint import random import sys # Set up your needle, haystack and alphabet size haystack = "0000111010011100101110110111000111101011110011111011111110 +000000100010010000101001010100000110100011001001100010110101011001101 +1000000" hay_len = len(haystack) ruler = "" cases = [ { 'NDL': "100001010010101000001101", 'syms':3, 'sz':8 }, { 'NDL': "100001010010101000001101", 'syms':4, 'sz':6 }, { 'NDL': "100001010010101000001101", 'syms':6, 'sz':4 }, { 'NDL': "100001010010101000001101", 'syms':8, 'sz':3 }, { 'NDL': "000010100101010000011010", 'syms':3, 'sz':8 }, { 'NDL': "000010100101010000011010", 'syms':4, 'sz':6 }, { 'NDL': "000010100101010000011010", 'syms':6, 'sz':4 }, { 'NDL': "000010100101010000011010", 'syms':8, 'sz':3 }, { 'NDL': "1000010100101010000011010", 'syms':5, 'sz':5 } ] case = 0 try: case = int(sys.argv[1]) except: pass needle = cases[case]['NDL'] num_syms = cases[case]['syms'] symbol_size = cases[case]['sz'] def main(): global ruler print "\n{0}\nHaystack: {1}\n Needle: {2}\n".format(syms(70, '-') +, haystack, needle) # Use brute force to find all matches positions = brute_force(needle, haystack) if len(positions)>0: ruler = "".join([ "V" if x in positions else " " for x in rang +e(0, len(haystack)) ]) print "Brute force:\n{0}\n{1}".format(ruler, haystack) print "{0}{1}\n".format(spaces(positions[0]), needle) else: print "Brute force: not found" if not find_needle(haystack, needle, num_syms, symbol_size): print "**** Case failed! ****" def find_needle(haystack, needle, num_syms, symbol_size): sym_marker = syms(symbol_size, '^') needle_syms = [ needle[i*symbol_size:(i+1)*symbol_size] for i in r +ange(0, num_syms)] max_shift = (num_syms-1)*symbol_size print "Modified BM search: {0} syms of {1} bits: {2}".format( num_syms, symbol_size, ", ".join(needle_syms)) # Compute delta table dd1 = { } for i in reversed(range(0, (1<<symbol_size))): sym = to_binstr(i, symbol_size) tmp = brute_force(sym, needle) if len(tmp)>0: bitofs = tmp[-1] shift = max_shift - bitofs ttt = ", ".join(map(str,tmp)) dd1[sym] = shift #bitofs # Uncomment to see the delta table #print pprint.pprint(dd1) # Search loop print "Searching for a needle in the haystack" needle_ptr = 0 while needle_ptr + len(needle) <= len(haystack): cmp_ptr = needle_ptr + max_shift found = haystack[cmp_ptr:cmp_ptr+symbol_size] print "\n{0}\n{1}".format(ruler, haystack) print "{0}{2} (pos={1})".format(spaces(cmp_ptr), cmp_ptr, sym_ +marker) print "{0}{1}".format(spaces(cmp_ptr), found) print "{0}{1}".format(spaces(needle_ptr), needle) if found in dd1: if dd1[found] == 0: fl_match = True for t in range(0, num_syms): if needle_syms[t] != haystack[needle_ptr + t*symbo +l_size:needle_ptr + (t+1)*symbol_size]: fl_match = False if fl_match: print "MATCH FOUND!" print "\n{0}\n{1}\n{2}{3}\n".format(haystack,ruler +,spaces(needle_ptr),needle) return True else: print "Byte matched, string mismatch, trying next +position" needle_ptr += 1 else: print "found {0}, delta=>{1}".format(found, dd1[found] +) needle_ptr += dd1[found] else: print "not found, big shift" needle_ptr += len(needle)-symbol_size return False def to_binstr(n,len): bv = [ "1" if n & (1<<x) else "0" for x in range(0, len)] return "".join(bv) def brute_force(needle, haystack): """Brute-force find all matching positions of needle in haystack"" +" positions = [] for i in range(0, len(haystack)-len(needle)+1): if needle == haystack[i:i+len(needle)]: positions.append(i) return positions def syms(n, sym): return "".join([ sym for x in range(0, n)]) def spaces(n): return syms(n, ' ') if __name__ == '__main__': main()

I should probably recode it in perl at some point. Old code & such below in readmore tags, should you be interested.

(BTW: pun was intended.)

Update: The code has a bug that I haven't found in it. (At least, I believe that it's a code issue rather than my ideas, but time will tell...I'll try to fix it tomorrow. Not content to leave well enough alone, I was playing around and changed the symbol size to 6 and number of symbols to 4, and it missed entirely, but it found 4 and 6.)


With the minimum shift distance being a bit and the symbol size as something larger (eight bits in the example), you have to make an adjustment to the BM delta 1 table computation by computing the shift distance for each position inside the needle.

At the ends of the needle, you have arbitrary bit patterns coming in, which forces you choose a strategy:

I'm kinda tired, so I'll leave it here today. If you have any questions about the code, let me know and I'll answer 'em tomorrow.

#!/usr/bin/env python import random import pprint def to_binstr(n,len): bv = [ "1" if n & (1<<x) else "0" for x in range(0, len)] return "".join(bv) def brute_force(needle, haystack): """Brute-force find all matching positions of needle in haystack"" +" positions = [] for i in range(0, len(haystack)-len(needle)+1): if needle == haystack[i:i+len(needle)]: positions.append(i) return positions def spaces(n): return "".join([ " " for x in range(0, n)]) haystack = "0000111010011100101110110111000111101011110011111011111110 +000000100010010000101001010100000110100011001001100010110101011001101 +1000000" needle1 = + "000010100101010000011010" needle = + "100001010010101000001101" hay_len = len(haystack) symbol_size = 8 # each symbol is 3 bits long num_syms = 3 # length of needle in "symbols" needle_len = num_syms * symbol_size # our needle is two symbols long # Use brute force to find all matches positions = brute_force(needle, haystack) ruler = "".join([ "^" if x in positions else " " for x in range(0, len +(haystack)) ]) print haystack print ruler for ofs in positions: print "{0}{1}".format(spaces(ofs),needle) print " " # Find the locations of all symbol matches for selected symbol size delta = [ -1 for i in range(0, (1<<symbol_size)) ] dd1 = { } for i in range(0, (1<<symbol_size)): sym = to_binstr(i, symbol_size) tmp = brute_force(sym, needle) if len(tmp)>0: bitofs = tmp[-1] delta[i] = bitofs dd1[sym] = bitofs print "NEEDLE: ", needle needle_syms = [ needle[i*symbol_size:(i+1)*symbol_size] for i in range +(0, num_syms)] # Search loop ptr = len(needle) - symbol_size while ptr < len(haystack): print haystack print ruler print "{0}^ (pos={1})".format(spaces(ptr-1), ptr) found = haystack[ptr - symbol_size:ptr] print "{0}{1}".format(spaces(ptr-symbol_size), found) print "{0}{1}".format(spaces(ptr-len(needle)), needle) if found in dd1: print "found {0}, delta=>{1}".format(found, dd1[found]) if dd1[found] == (num_syms-1)*symbol_size: print "MATCHED!" break elif dd1[found]>0: print "shifting" ptr += dd1[found] else: print "WTF?" break else: print "not found, big shift" ptr += len(needle)-symbol_size

When I run it with the second case provided, it gives:

$ python boyer_moore_bitstr.py 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ + 100001010010101000001101 NEEDLE: 100001010010101000001101 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ ^ (pos=16) 10011100 100001010010101000001101 not found, big shift 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ ^ (pos=32) 01110001 100001010010101000001101 not found, big shift 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ ^ (pos=48) 11001111 100001010010101000001101 not found, big shift 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ ^ (pos= +64) 10000000 100001010010101000001101 not found, big shift 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ + ^ (pos=80) + 00001010 10000101001010 +1000001101 found 00001010, delta=>1 shifting 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ + ^ (pos=81) + 00010100 1000010100101 +01000001101 found 00010100, delta=>2 shifting 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ + ^ (pos=83) + 01010010 10000101001 +0101000001101 found 01010010, delta=>4 shifting 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ + ^ (pos=87) + 00101010 1000010 +10010101000001101 found 00101010, delta=>8 shifting 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 + ^ + ^ (pos=95) + 00001101 + 100001010010101000001101 found 00001101, delta=>16 MATCHED! Roboticus@Waubli ~ $

Sorry for the line wrapping....

...roboticus

When your only tool is a hammer, all problems look like your thumb.

  • Comment on Re^3: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
  • Select or Download Code