in reply to Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

The likely cause of misapprehension here is the persistent use of "character" in any description of a string search algorithm one is likely to come across. Which is easy to confuse with a character type, or byte, as programmers might call it. A symbol might be more appropriate, because those algorithms operate on abstract alphabets.

If one were to implement unmodified B-M on bitstrings, the "characters" would be individual bits. The bad character shift table would have two slots, for '0' and for '1'. Practical value of such arrangement is nil. But the algorithms can be augmented to handle multiple "characters" at a time, and this is when they become lucrative.

  • Comment on Re: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

Replies are listed 'Best First'.
Re^2: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by BrowserUk (Patriarch) on Apr 05, 2015 at 17:43 UTC
    The bad character shift table would have two slots, for '0' and for '1'. Practical value of such arrangement is nil.

    Exactly! (Glad that someone, even an anonymous someone :) also sees the problem.

    But the algorithms can be augmented to handle multiple "characters" at a time, and this is when they become lucrative.

    Clue sticks and references cordially invited?

    (Though my immediate reaction is that compound characters simply reduce to a different alphabet with more bits -- ie. (ab) becomes a single symbol that requires twice as many bits -- and the problem remains the same, or is compounded because the lookup table(s) need to be bigger.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      BrowserUk:

      Update 2: I couldn't leave it alone (damned OCD), so I think it works now. I definitely had a few code problems, and the new version looks pretty decent to me. (The variable names could be better, but I'm a bit too tired to mess with it.

      It looks like one of my previous editing passes removed the reason that I coded it in Python. At $work, I'm having to use Python and JavaScript to write some stuff. I chose the lesser of two evals.

      (Believed) correct code:

      #!/usr/bin/env python import pprint import random import sys # Set up your needle, haystack and alphabet size haystack = "0000111010011100101110110111000111101011110011111011111110 +000000100010010000101001010100000110100011001001100010110101011001101 +1000000" hay_len = len(haystack) ruler = "" cases = [ { 'NDL': "100001010010101000001101", 'syms':3, 'sz':8 }, { 'NDL': "100001010010101000001101", 'syms':4, 'sz':6 }, { 'NDL': "100001010010101000001101", 'syms':6, 'sz':4 }, { 'NDL': "100001010010101000001101", 'syms':8, 'sz':3 }, { 'NDL': "000010100101010000011010", 'syms':3, 'sz':8 }, { 'NDL': "000010100101010000011010", 'syms':4, 'sz':6 }, { 'NDL': "000010100101010000011010", 'syms':6, 'sz':4 }, { 'NDL': "000010100101010000011010", 'syms':8, 'sz':3 }, { 'NDL': "1000010100101010000011010", 'syms':5, 'sz':5 } ] case = 0 try: case = int(sys.argv[1]) except: pass needle = cases[case]['NDL'] num_syms = cases[case]['syms'] symbol_size = cases[case]['sz'] def main(): global ruler print "\n{0}\nHaystack: {1}\n Needle: {2}\n".format(syms(70, '-') +, haystack, needle) # Use brute force to find all matches positions = brute_force(needle, haystack) if len(positions)>0: ruler = "".join([ "V" if x in positions else " " for x in rang +e(0, len(haystack)) ]) print "Brute force:\n{0}\n{1}".format(ruler, haystack) print "{0}{1}\n".format(spaces(positions[0]), needle) else: print "Brute force: not found" if not find_needle(haystack, needle, num_syms, symbol_size): print "**** Case failed! ****" def find_needle(haystack, needle, num_syms, symbol_size): sym_marker = syms(symbol_size, '^') needle_syms = [ needle[i*symbol_size:(i+1)*symbol_size] for i in r +ange(0, num_syms)] max_shift = (num_syms-1)*symbol_size print "Modified BM search: {0} syms of {1} bits: {2}".format( num_syms, symbol_size, ", ".join(needle_syms)) # Compute delta table dd1 = { } for i in reversed(range(0, (1<<symbol_size))): sym = to_binstr(i, symbol_size) tmp = brute_force(sym, needle) if len(tmp)>0: bitofs = tmp[-1] shift = max_shift - bitofs ttt = ", ".join(map(str,tmp)) dd1[sym] = shift #bitofs # Uncomment to see the delta table #print pprint.pprint(dd1) # Search loop print "Searching for a needle in the haystack" needle_ptr = 0 while needle_ptr + len(needle) <= len(haystack): cmp_ptr = needle_ptr + max_shift found = haystack[cmp_ptr:cmp_ptr+symbol_size] print "\n{0}\n{1}".format(ruler, haystack) print "{0}{2} (pos={1})".format(spaces(cmp_ptr), cmp_ptr, sym_ +marker) print "{0}{1}".format(spaces(cmp_ptr), found) print "{0}{1}".format(spaces(needle_ptr), needle) if found in dd1: if dd1[found] == 0: fl_match = True for t in range(0, num_syms): if needle_syms[t] != haystack[needle_ptr + t*symbo +l_size:needle_ptr + (t+1)*symbol_size]: fl_match = False if fl_match: print "MATCH FOUND!" print "\n{0}\n{1}\n{2}{3}\n".format(haystack,ruler +,spaces(needle_ptr),needle) return True else: print "Byte matched, string mismatch, trying next +position" needle_ptr += 1 else: print "found {0}, delta=>{1}".format(found, dd1[found] +) needle_ptr += dd1[found] else: print "not found, big shift" needle_ptr += len(needle)-symbol_size return False def to_binstr(n,len): bv = [ "1" if n & (1<<x) else "0" for x in range(0, len)] return "".join(bv) def brute_force(needle, haystack): """Brute-force find all matching positions of needle in haystack"" +" positions = [] for i in range(0, len(haystack)-len(needle)+1): if needle == haystack[i:i+len(needle)]: positions.append(i) return positions def syms(n, sym): return "".join([ sym for x in range(0, n)]) def spaces(n): return syms(n, ' ') if __name__ == '__main__': main()

      I should probably recode it in perl at some point. Old code & such below in readmore tags, should you be interested.

      (BTW: pun was intended.)

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

      Given the needle:

      00001010 01010100 00011010
      
      The bad shift table for handling eight bits at a time:
      00001010                        needle offset 0
       0001010 0                      needle offset 1
        001010 01                     needle offset 2
         01010 010                    needle offset 3
          1010 0101                   needle offset 4
           010 01010                  needle offset 5
            10 010101                 needle offset 6
             0 0101010                needle offset 7
               01010100               needle offset 8
                1010100 0             needle offset 9
                 010100 00            needle offset 10
                  10100 000           needle offset 11
                   0100 0001          needle offset 12
                    100 00011         needle offset 13
                     00 000110        needle offset 14
                      0 0001101       needle offset 15
                        00011010      needle offset 16
      
      The shift follows from needle offset. If some 8-bit value repeats, the smallest shift is used.

        Okay? So, we have your table, the original haystack and the needle you used:

        00000110 needle offset 14 00001010 needle offset 0 00001101 needle offset 15 00010100 needle offset 1 00011010 needle offset 16 00101001 needle offset 2 00101010 needle offset 7 01000001 needle offset 12 01001010 needle offset 5 01010000 needle offset 10 01010010 needle offset 3 01010100 needle offset 8 10000011 needle offset 13 10010101 needle offset 6 10100000 needle offset 11 10100101 needle offset 4 10101000 needle offset 9 0000111010011100101110110111000111101011110011111011111110000000100010 +0100001010010101000001101000110010011000101101010110011011000000 000010100101010000011010

        We've compared the needle at the first position of the haystack and it doesn't match. Now what?

        1. Which group of 8 bits from the haystack do we lookup in the table?
        2. And how do we translate the offset at which that pattern was found in the needle; into a number of bits to shift the needle?

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked