in reply to Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
in thread Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

The bad character shift table would have two slots, for '0' and for '1'. Practical value of such arrangement is nil.

Exactly! (Glad that someone, even an anonymous someone :) also sees the problem.

But the algorithms can be augmented to handle multiple "characters" at a time, and this is when they become lucrative.

Clue sticks and references cordially invited?

(Though my immediate reaction is that compound characters simply reduce to a different alphabet with more bits -- ie. (ab) becomes a single symbol that requires twice as many bits -- and the problem remains the same, or is compounded because the lookup table(s) need to be bigger.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
  • Comment on Re^2: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

Replies are listed 'Best First'.
Re^3: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by roboticus (Chancellor) on Apr 05, 2015 at 19:58 UTC

    BrowserUk:

    Update 2: I couldn't leave it alone (damned OCD), so I think it works now. I definitely had a few code problems, and the new version looks pretty decent to me. (The variable names could be better, but I'm a bit too tired to mess with it.

    It looks like one of my previous editing passes removed the reason that I coded it in Python. At $work, I'm having to use Python and JavaScript to write some stuff. I chose the lesser of two evals.

    (Believed) correct code:

    #!/usr/bin/env python import pprint import random import sys # Set up your needle, haystack and alphabet size haystack = "0000111010011100101110110111000111101011110011111011111110 +000000100010010000101001010100000110100011001001100010110101011001101 +1000000" hay_len = len(haystack) ruler = "" cases = [ { 'NDL': "100001010010101000001101", 'syms':3, 'sz':8 }, { 'NDL': "100001010010101000001101", 'syms':4, 'sz':6 }, { 'NDL': "100001010010101000001101", 'syms':6, 'sz':4 }, { 'NDL': "100001010010101000001101", 'syms':8, 'sz':3 }, { 'NDL': "000010100101010000011010", 'syms':3, 'sz':8 }, { 'NDL': "000010100101010000011010", 'syms':4, 'sz':6 }, { 'NDL': "000010100101010000011010", 'syms':6, 'sz':4 }, { 'NDL': "000010100101010000011010", 'syms':8, 'sz':3 }, { 'NDL': "1000010100101010000011010", 'syms':5, 'sz':5 } ] case = 0 try: case = int(sys.argv[1]) except: pass needle = cases[case]['NDL'] num_syms = cases[case]['syms'] symbol_size = cases[case]['sz'] def main(): global ruler print "\n{0}\nHaystack: {1}\n Needle: {2}\n".format(syms(70, '-') +, haystack, needle) # Use brute force to find all matches positions = brute_force(needle, haystack) if len(positions)>0: ruler = "".join([ "V" if x in positions else " " for x in rang +e(0, len(haystack)) ]) print "Brute force:\n{0}\n{1}".format(ruler, haystack) print "{0}{1}\n".format(spaces(positions[0]), needle) else: print "Brute force: not found" if not find_needle(haystack, needle, num_syms, symbol_size): print "**** Case failed! ****" def find_needle(haystack, needle, num_syms, symbol_size): sym_marker = syms(symbol_size, '^') needle_syms = [ needle[i*symbol_size:(i+1)*symbol_size] for i in r +ange(0, num_syms)] max_shift = (num_syms-1)*symbol_size print "Modified BM search: {0} syms of {1} bits: {2}".format( num_syms, symbol_size, ", ".join(needle_syms)) # Compute delta table dd1 = { } for i in reversed(range(0, (1<<symbol_size))): sym = to_binstr(i, symbol_size) tmp = brute_force(sym, needle) if len(tmp)>0: bitofs = tmp[-1] shift = max_shift - bitofs ttt = ", ".join(map(str,tmp)) dd1[sym] = shift #bitofs # Uncomment to see the delta table #print pprint.pprint(dd1) # Search loop print "Searching for a needle in the haystack" needle_ptr = 0 while needle_ptr + len(needle) <= len(haystack): cmp_ptr = needle_ptr + max_shift found = haystack[cmp_ptr:cmp_ptr+symbol_size] print "\n{0}\n{1}".format(ruler, haystack) print "{0}{2} (pos={1})".format(spaces(cmp_ptr), cmp_ptr, sym_ +marker) print "{0}{1}".format(spaces(cmp_ptr), found) print "{0}{1}".format(spaces(needle_ptr), needle) if found in dd1: if dd1[found] == 0: fl_match = True for t in range(0, num_syms): if needle_syms[t] != haystack[needle_ptr + t*symbo +l_size:needle_ptr + (t+1)*symbol_size]: fl_match = False if fl_match: print "MATCH FOUND!" print "\n{0}\n{1}\n{2}{3}\n".format(haystack,ruler +,spaces(needle_ptr),needle) return True else: print "Byte matched, string mismatch, trying next +position" needle_ptr += 1 else: print "found {0}, delta=>{1}".format(found, dd1[found] +) needle_ptr += dd1[found] else: print "not found, big shift" needle_ptr += len(needle)-symbol_size return False def to_binstr(n,len): bv = [ "1" if n & (1<<x) else "0" for x in range(0, len)] return "".join(bv) def brute_force(needle, haystack): """Brute-force find all matching positions of needle in haystack"" +" positions = [] for i in range(0, len(haystack)-len(needle)+1): if needle == haystack[i:i+len(needle)]: positions.append(i) return positions def syms(n, sym): return "".join([ sym for x in range(0, n)]) def spaces(n): return syms(n, ' ') if __name__ == '__main__': main()

    I should probably recode it in perl at some point. Old code & such below in readmore tags, should you be interested.

    (BTW: pun was intended.)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re^3: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by Anonymous Monk on Apr 05, 2015 at 18:13 UTC

    Given the needle:

    00001010 01010100 00011010
    
    The bad shift table for handling eight bits at a time:
    00001010                        needle offset 0
     0001010 0                      needle offset 1
      001010 01                     needle offset 2
       01010 010                    needle offset 3
        1010 0101                   needle offset 4
         010 01010                  needle offset 5
          10 010101                 needle offset 6
           0 0101010                needle offset 7
             01010100               needle offset 8
              1010100 0             needle offset 9
               010100 00            needle offset 10
                10100 000           needle offset 11
                 0100 0001          needle offset 12
                  100 00011         needle offset 13
                   00 000110        needle offset 14
                    0 0001101       needle offset 15
                      00011010      needle offset 16
    
    The shift follows from needle offset. If some 8-bit value repeats, the smallest shift is used.

      Okay? So, we have your table, the original haystack and the needle you used:

      00000110 needle offset 14 00001010 needle offset 0 00001101 needle offset 15 00010100 needle offset 1 00011010 needle offset 16 00101001 needle offset 2 00101010 needle offset 7 01000001 needle offset 12 01001010 needle offset 5 01010000 needle offset 10 01010010 needle offset 3 01010100 needle offset 8 10000011 needle offset 13 10010101 needle offset 6 10100000 needle offset 11 10100101 needle offset 4 10101000 needle offset 9 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 000010100101010000011010

      We've compared the needle at the first position of the haystack and it doesn't match. Now what?

      1. Which group of 8 bits from the haystack do we lookup in the table?
      2. And how do we translate the offset at which that pattern was found in the needle; into a number of bits to shift the needle?

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        The first group to inspect is at position (needle_len - 8): 10111011. That does not appear in the needle, so apply the maximum shift, which is 17 (needle_len - 8 + 1). Were it present in the needle, then the shift would've been (needle_len - 8 - offset), followed by compare.