in reply to Re^6: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
in thread Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

In table +12 => Shift by 4 (24-8-12).

  • Comment on Re^7: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)

Replies are listed 'Best First'.
Re^8: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (Thank you Anonymonk + others.)
by BrowserUk (Patriarch) on Apr 08, 2015 at 05:32 UTC

    It required one small but important modification to your algorithm; that was a bitch to track down.

    The table calculation: table[ idx ] = needleLen - 8 + needleOffset; can result in a value of 0 (ie. 24 - 8 - 16 = 0 ), which means that the algorithm stalls.

    My current fix is to simply test for and replace any 0 table entries with 1 so it keeps moving forward. (I suspect that there is a better fix, but its not leaping off the page at me; and its time to sleep.)

    The upshot is that a brute force search that takes 9.72 seconds (a 220 bits needle in a 231 bits haystack, found at an offset of 2e9) takes:

    1. 8-bit table: 2.807604e-002 secs.
    2. 12-bit table: 7.613071e-003 secs.
    3. 16-bit table: 4.428185e-003 secs.

    Over 3 orders of magnitude speed up is all together a very successful outcome. Thank you anonymonk.

    Thanks also to salva & oiskuu who's persistent belief that a fast-bit-search was possible, led me to start this thread and get this solution.

    Here is a table of timings; bitstrstr1 is the brute-force; 3 is the 8-bit; 4 is the 16-bit; and 5 the 12-bit.

    The first entry (-1) is a full search for a non-existent needle:

    C:\test\C>bitstrstr "DeBruijn(2^31).bin" 1048576 1 250000000 v:00000000008E0040 fsize:0000000010000004 vsize:2000000 bitstrstr1: 0 : -1 took:1.030802e+001 secs bitstrstr1: 250000000 : 250000000 took:1.274025e+000 secs bitstrstr1: 500000000 : 500000000 took:2.420335e+000 secs bitstrstr1: 750000000 : 750000000 took:3.604043e+000 secs bitstrstr1: 1000000000 : 1000000000 took:4.826112e+000 secs bitstrstr1: 1250000000 : 1250000000 took:6.205683e+000 secs bitstrstr1: 1500000000 : 1500000000 took:7.483219e+000 secs bitstrstr1: 1750000000 : 1750000000 took:8.526418e+000 secs bitstrstr1: 2000000000 : 2000000000 took:9.723123e+000 secs bitstrstr3: 0 : -1 took:1.019202e-001 secs bitstrstr3: 250000000 : 250000000 took:3.729577e-002 secs bitstrstr3: 500000000 : 500000000 took:3.556026e-002 secs bitstrstr3: 750000000 : 750000000 took:1.154790e-002 secs bitstrstr3: 1000000000 : 1000000000 took:1.110663e-002 secs bitstrstr3: 1250000000 : 1250000000 took:1.219347e-002 secs bitstrstr3: 1500000000 : 1500000000 took:1.436291e-002 secs bitstrstr3: 1750000000 : 1750000000 took:2.335553e-002 secs bitstrstr3: 2000000000 : 2000000000 took:2.807604e-002 secs bitstrstr4: 0 : -1 took:8.661798e-003 secs bitstrstr4: 250000000 : 250000000 took:4.236888e-003 secs bitstrstr4: 500000000 : 500000000 took:4.733442e-003 secs bitstrstr4: 750000000 : 750000000 took:4.514161e-003 secs bitstrstr4: 1000000000 : 1000000000 took:4.211377e-003 secs bitstrstr4: 1250000000 : 1250000000 took:4.328529e-003 secs bitstrstr4: 1500000000 : 1500000000 took:4.249651e-003 secs bitstrstr4: 1750000000 : 1750000000 took:4.525668e-003 secs bitstrstr4: 2000000000 : 2000000000 took:4.428185e-003 secs bitstrstr5: 0 : -1 took:2.952949e-002 secs bitstrstr5: 250000000 : 250000000 took:1.516276e-002 secs bitstrstr5: 500000000 : 500000000 took:7.000594e-003 secs bitstrstr5: 750000000 : 750000000 took:6.885374e-003 secs bitstrstr5: 1000000000 : 1000000000 took:6.434349e-003 secs bitstrstr5: 1250000000 : 1250000000 took:6.177053e-003 secs bitstrstr5: 1500000000 : 1500000000 took:6.085954e-003 secs bitstrstr5: 1750000000 : 1750000000 took:8.395146e-003 secs bitstrstr5: 2000000000 : 2000000000 took:7.613071e-003 secs

    The code needs a little clean-up, but I'll post it if anyone is interested?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Indeed. Horspool, QuickSearch, and the version by anony-me are all subtly different.

      Horspool does not insert the last character/fragment into the shift table. Compare is triggered when the test fragment matches the needle end (the latter is loop invariant, fetch before loop).

      Quick Search does insert the last fragment. The window and the shifts are one larger; the test position one further. But then, each iteration does a full compare unconditionally.

      They both have a loop dependency on the table element. If the table is not densely populated, it can be beneficial to split the loop increment to either side of the test-and-branch. A pos += max_shift on the branch-predicted path has no memory dependency, thus allowing the loop to run speculatively ahead several iterations.

        They both have a loop dependency on the table element. If the table is not densely populated, it can be beneficial to split the loop increment to either side of the test-and-branch. A pos += max_shift on the branch-predicted path has no memory dependency, thus allowing the loop to run speculatively ahead several iterations.

        That sounds interesting. I think I get what you mean by the third sentence; but the emboldened bit of the second sentence could do with a little fleshing out?

        If it helps, here is the 16-bit table code. Perhaps you could indicate what you would modify? No need to test; just suggest :)

        static U64 count = 0; // just for gathering statistics. #define SL( v, o ) __shiftleft128( *((v)+1), *((v)), o ) void scan16( I64 *table, const U64 *pin, U8 opin, U64 lpin ) { U64 i; for( i = 0; i < 65536; ++i ) table[ i ] = lpin - 16 + 1; for( i = 0; i < lpin - 16 +1; ++i ) { U64 q = SL( pin+(i/64), (opin+i)%64 ); U16 idx = (q & 0xffff000000000000) >> 48ull; table[ idx ] = lpin - 16 - i; if( table[ idx ] < 1 ) table[ idx ] = 1; } } __forceinline U8 cmpBits4( const register U64 *hay, const register U64 *pin, U8 regi +ster ohay, U8 register opin, U64 lpin ) { U64 register i; for( i=0; i < lpin/64; ++i, ++hay, ++pin ) { ++count; if( SL( hay, ohay ) != SL( pin, opin ) ) return 0; } { U8 lastbits = lpin % 64; U64 mask = ( ( 1ull << ( lastbits ) ) -1 ) << ( 64 - lastbits +); ++count; if( ( SL( hay, ohay ) & mask ) != ( SL( pin, opin ) & mask ) ) + return 0; } return 1; } U64 bitstrstr4( const register U64 *hay, U64 *pin, U8 register ohay, U +8 register opin, U64 lhay, U64 lpin ) { U64 register i; I64 table[65536]; scan16( table, pin, opin, lpin ); count = 0; for( i = ohay; i < lhay+ohay-lpin; ) { if( cmpBits4( hay+(i/64), pin, i%64, opin, lpin ) ) return i; else { U16 idx = ( SL( hay+((i+lpin-16)/64), (i+lpin-16)%64 ) & 0 +xffff000000000000 ) >> 48; i += table[ idx ]; } } return -1; }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^8: Why Boyer-Moore, Horspool, alpha-skip et.al don't work for bit strings. (And is there an alternative that does?)
by BrowserUk (Patriarch) on Apr 05, 2015 at 21:30 UTC

    You did say that already didn't you. Sorry for having to be spoon-fed.

    The upshot is an algorithm I understand; seems to work; and I can implement. Thank you.

    I have certain fears that it won't yield huge saving, because with a modest needle size of 1024 (my needles can be several million bits), you get 1024 - 8 + 1 8-bit patterns, thus an average of 4 candidates per pattern. With the maximum shift very unlikely, and using the minimum of the (ave.) four possibles for each pattern, the shifts are likely to be very modest I think.

    Which I guess means, increase the size of the table. 1024 - 16 +1 / 65536 = 0.0153961181640625 per pattern. Should yield better results; at the penalty of a larger table and greater competition for cache.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked