in reply to Reaped: Accidental dup; please delete.
in thread [OT] The interesting problem of comparing bit-strings.

For flat arrays, in order to insert an element in an array of N elements, you have to perform two operations: 1) find the insertion point which is an O(N) operation, and 2) insert the new value there which is also O(N) because you have to move all the elements after the insertion point by one place. Globally the operation is O(N).

With lists, step 2, is cheaper, but you still have to find the insertion point in the list which is O(N), so globally, the operation remains O(N).

Lists are cache unfriendly by two reasons: 1) they use more memory than arrays (for built-in types at least x2, but x4 or x8 commonly) and 2) they may be scattered in memory and turn the cache prefetching useless. So it is easy to get into a situation where advancing to the next element requires always going to L3 or even RAM.

In contrast, navigating an array, even when it doesn't fit into L2, is much faster because the cache prefetching is fully effective.

BTW, You get O(logN) insertions when you use a tree.

  • Comment on Re^7: [OT] The interesting problem of comparing (long) bit-strings.

Replies are listed 'Best First'.
Re^8: [OT] The interesting problem of comparing (long) bit-strings.
by BrowserUk (Patriarch) on Mar 31, 2015 at 11:03 UTC

    I take it that Boyer Moore was a bust then?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      How is your data?

      If it is mostly random, without repeated patters (for instance, most bits being 0), and long needles, B-M can potentially be several orders of magnitude faster than the brute-force approach.

      On the bad data cases, B-M would just become equivalent to the brute-force algorithm. I don't think it would introduce too much overhead.

        How is your data?

        There's no way to say as it it intended for general use. Some bitsets may be sparse; some very dense.

        The current use-case is characterised by substantial chunks of zero data interspersed with longish runs of fairly dense stuff.

        B-M can potentially be several orders of magnitude faster than the brute-force approach

        You keep claiming that, but for all the reasons I've outlined elsewhere, I do not believe it.

        I'd be happy to be proved wrong, because in the end, I just want the fastest search I can code, but I can't even see how you would adapt B-M to bit-string search.

        I'm doing my comparisons in 64-bit chunks; but building delta tables with 64-bit indices is obviously not on.

        So, you look to doing byte sized compares in order to keep the table sizes reasonable.

        BUT:

        1. Doing 8 byte-byte compares instead of a single quad-quad compare, cost way more than 8 times as much.

          Not only does it require 8 cmp instructions instead of 1, it also requires 8 counter increments and 8 jumps.

          Even if the compiler unwound the loop -- which it doesn't -- or I coded it in assembler, which I won't, it would still take substantially more than 8 times longer because loading a 64-bit register with 8-bit units, means the microcode has to shuffle 7 of the 8-bytes into the low-8 bits of the register. And it has to do that for both comparands.

          So, the 8 x 8-bit compares versus 1 64-bit is more than 8 times slower.

          But don't forget that for each n-bits, you need to do n comparisons with one of the comparands shift 1-bit each time.

          So now the delta between 1 x 64-bit comparison and 8 x (unaligned) 8-bit comparisons, becomes 64 x 64-bit comparison versus 64 x 8-bit comparisons.

          And that's not to mention that the 8-bit values from which bits need to be shifted in will also need to be shuffled by the microcode, adding further overheads.

        2. Instead of 2 tables (4*needle length in bytes each) you'd need 16 tables in order to deal with the bit-aligned nature of the needle.

          For a modest-size 8192-bit needle, you're looking at 16*4*1024 = 64k of table space, that needs to be randomly accessed and thus wiping out my 32k L1 cache in the process.

        I don't know for sure, because I haven't tried it, because I don't believe it would be beneficial. I'd be happy to be proved wrong, but I don't think I will be.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked