in reply to Re: [OT] The interesting problem of comparing (long) bit-strings.
in thread [OT] The interesting problem of comparing bit-strings.

Alright. Here's a demo of alpha skip search on bitstrings. The needle size is fixed, but it should be trivial to follow this with a longer compare. A possibility exists that some off-by-one errors have crept in. If that's the case, just deduct from my paycheck. :-)

Alpha skip is quadratic in the worst case and probably not the best choice when nasty matches are expected. Anyway, I'm curious how this might hold up against the optimized brute-force routines.

// Feed this with random needle and haystack: // dd if=/dev/urandom count=1 bs=77777 | ./a.out // // Fixed size needle. No optimizations attempted. #include <stdio.h> #include <stdint.h> #include <string.h> #include <unistd.h> enum { M = 8192, logM = 13, Mmask = M - 1, Mskip = M - logM + 1, Mnil }; typedef uint32_t U32; typedef uint16_t U16; typedef struct { U16 head[M], next[M]; // index chains } skiptab; // extract logM bits at bit offset i. assert(logM <= 16) static inline size_t fbits(char *v, size_t i) { U32 *x = (U32 *)&v[2 * (i >> 4)]; return Mmask & (*x >> (i & 15)); } // 'tis rotten. should make shifted copies of x and memcmp the middle. static size_t xcmp(char *y, size_t j, char *x) { size_t i, diff = fbits(y, j) - fbits(x, 0); for (i = M % logM; i < M && !diff; i += logM) { diff = fbits(y, i + j) - fbits(x, i); } return diff; } static void prep(char *x, skiptab *Z) { size_t i, f; for (i = 0; i < M; i++) { Z->head[i] = Mnil; } for (i = 0; i < Mskip; i++) { f = fbits(x, i); Z->next[i] = Z->head[f]; Z->head[f] = i; } } // y of length n is haystack; x is the needle size_t search(char *y, size_t n, char *x, size_t m) { skiptab Z[1]; size_t i, j; // assert(m == M); prep(x, Z); n -= logM; for (i = -1; (i += Mskip) <= n;) { j = fbits(y, i); for (j = Z->head[j]; j != Mnil; j = Z->next[j]) { if (!xcmp(y, i-j, x)) return i-j; } } return -1; } // one-bit rotate void rot1(char *v, size_t n) { size_t i, c; for (c = i = 0; i < n; c >>= 8, i++) { v[i] = c += *(unsigned char *)(v+i) << 1; } v[0] += c; } int main(int argc, char *argv[]) { static char x[M/8], y[10*M]; size_t r, i, n, m = sizeof x; if (read(0, x, sizeof x) != m) return printf("read x"), 1; n = read(0, y, sizeof y); if (n < m + 10) return printf("read y"), 1; r = y[n -= 1] & 7; i = *(U32*)&y[n -= 4]; i = i % (n - m - 1); printf("rot %zu at %zu; bitoff=%zu\n", r, i, r + i*8); memcpy(y + i, x, m); // stuff the needle for (; r; r--) rot1(y + i, m + 1); // and rotate printf("search=%ld\n", (long)search(y, n*8, x, m*8)); return 0; }

Note: to change the needle size, just edit the M and logM values. (Compile with -fno-strict-aliasing, or fix the type-punning).

  • Comment on Re^2: [OT] The interesting problem of comparing (long) bit-strings.
  • Download Code

Replies are listed 'Best First'.
Re^3: [OT] The interesting problem of comparing (long) bit-strings. (Doesn't work!)
by BrowserUk (Patriarch) on Mar 29, 2015 at 16:29 UTC

    Making minimal changes to your code to allow it compile here, and it simply never finds anything(very quickly), no matter how simple I make the search criteria:

    C:\test\C>cl /W3 bitsearch-oiskuu.c Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64 Copyright (C) Microsoft Corporation. All rights reserved. bitsearch-oiskuu.c bitsearch-oiskuu.c(49) : warning C4267: '=' : conversion from 'size_t' + to 'U16', possible loss of data bitsearch-oiskuu.c(78) : warning C4267: '=' : conversion from 'size_t' + to 'char', possible loss of data bitsearch-oiskuu.c(80) : warning C4267: '+=' : conversion from 'size_t +' to 'char', possible loss of data Microsoft (R) Incremental Linker Version 9.00.21022.08 Copyright (C) Microsoft Corporation. All rights reserved. /out:bitsearch-oiskuu.exe bitsearch-oiskuu.obj C:\test\C>bitsearch-oiskuu Result -1 Took 1.837746e-004 secs

    I've wasted enough time on it; and won't be expending any more, but here is the modified code that compiles clean (apart from warnings that originate from your posted code):

    #include <stdio.h> #include <stdlib.h> //#include <stdint.h> #include <string.h> //#include <unistd.h> //typedef uint32_t U32; //typedef uint16_t U16; typedef unsigned __int64 U64; typedef unsigned __int32 U32; typedef unsigned __int16 U16; typedef unsigned __int8 U8; enum { M = 8192, logM = 13, Mmask = M - 1, Mskip = M - logM + 1, Mnil +}; typedef struct { U16 head[M], next[M]; // index chains } skiptab; // extract logM bits at bit offset i. assert(logM <= 16) static __inline size_t fbits(char *v, size_t i) { U32 *x = (U32 *)&v[2 * (i >> 4)]; return Mmask & (*x >> (i & 15)); } // 'tis rotten. should make shifted copies of x and memcmp the middle. static size_t xcmp(char *y, size_t j, char *x) { size_t i, diff = fbits(y, j) - fbits(x, 0); for (i = M % logM; i < M && !diff; i += logM) { diff = fbits(y, i + j) - fbits(x, i); } return diff; } static void prep(char *x, skiptab *Z) { size_t i, f; for (i = 0; i < M; i++) { Z->head[i] = Mnil; } for (i = 0; i < Mskip; i++) { f = fbits(x, i); Z->next[i] = Z->head[f]; Z->head[f] = i; } } // y of length n is haystack; x is the needle size_t search(char *y, size_t n, char *x, size_t m) { skiptab Z[1]; size_t i, j; // assert(m == M); prep(x, Z); n -= logM; for (i = -1; (i += Mskip) <= n;) { j = fbits(y, i); for (j = Z->head[j]; j != Mnil; j = Z->next[j]) { if (!xcmp(y, i-j, x)) return i-j; } } return -1; } // one-bit rotate void rot1(char *v, size_t n) { size_t i, c; for (c = i = 0; i < n; c >>= 8, i++) { v[i] = c += *(unsigned char *)(v+i) << 1; } v[0] += c; } int main( int argc, char **argv ) { U64 hay[1] = { 0xdeadbeefdeadbeef }; U64 ndl[1] = { 0xdeadbeefdeadbeef }; U64 start, end, found = 0; start = __rdtsc(); found = search( (char*)hay, sizeof(hay)*8, (char*)ndl, sizeof(ndl) +*8 ); _ReadBarrier(); end = __rdtsc(); printf( "Result %12d\t", found ); printf( "Took %9e secs\n", (double)( end - start ) / 2394046359.0 +); return 0; }

    I'd suggest you enter it in an obfuscated C competition; could be a winner.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re^3: [OT] The interesting problem of comparing (long) bit-strings.
by BrowserUk (Patriarch) on Mar 29, 2015 at 09:38 UTC
    'm curious how this might hold up against the optimized brute-force routines.

    It's 3 orders of magnitude slower.

      More or less, yeah.

      [ 0.033410] needle 512, haystack 134217216 bytes [ 0.000030] rot 5 at 8000007; bitoff=64000061 [ 0.000017] needle stuffed! [ 0.000644] search=64000061
      I see your table lists "2.483719e-001" for the corresponding case. Brute force appears to be >300 times slower (0.2483719/0.000644).

      Looks like { M = 16384, logM = 14 } case is the fastestfaster, with tables weighing in at 64KB. Beyond that, slowdown. (0.2784286/0.000236 == ~1180). Even bigger needles/tables may be quicker still.