in reply to [OT] The interesting problem of comparing bit-strings.
You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)*n/m), on the average. With long needles (length m), that'll be a blast. However, you don't get anything for free: preprocessing requirements are O(m) in time and space.
Google for "Turbo Reverse Factor" and "Alpha Skip Search". Wikipedia seems to lack pages for these. With alpha skip, you create a trie (or a table) where you push the offsets of all substrings of length log(m). Bitstrings are just strings on size=2 alphabet. Which algorithm or variant thereof will prove most suitable for you is something hardly anyone here could guess.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: [OT] The interesting problem of comparing (long) bit-strings.
by BrowserUk (Patriarch) on Mar 28, 2015 at 10:40 UTC | |
You'll likely want a search that is strongly sub-linear, ie better than O(n). There are string search algorithms that approach O(log(m)*n/m), on the average. On modern hardware, the single most expensive avoidable* 'operation' is a cache miss. With instruction cache misses more expensive data cache misses due to pipelining. *context switches are more expensive, but are, at least in the long term, unavoidable. The problem with (all?) the analysies of the various sub-linear algorithms; is that they ignore (or were done before), the costs & benefits of 2 and 3 levels of on-chip instruction & data cache, became a significant factor in instrumenting this type of algorithm. My 7 y/o cpu, with only 2-level caching, can perform upwards of 1000 instructions in the time it takes to fulfill a data-cache miss. The effect for an instruction cache miss is even greater. Sub-linear algorithms exacerbate both problems, in 3 different ways: The beauty of well-written brute force algorithms, is that they derive absolute maximum benefit from the cache effect, by doing as much work as possible on each cache line before allowing it to get flushed thus minimising the need for reloads: In the following table of benchmark timings from my implementation: Summary. The algorithm is very linear in terms of the distance into the buffer; and sub-linear in terms of the length of the needle. And it is very fast.
Can it be beaten? Maybe, but I very much doubt it will be done using Boyer Moore or similar, jump-around-the-data algorithms. Time will tell :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] [d/l] |
|
Re^2: [OT] The interesting problem of comparing (long) bit-strings.
by oiskuu (Hermit) on Mar 29, 2015 at 09:10 UTC | |
Alright. Here's a demo of alpha skip search on bitstrings. The needle size is fixed, but it should be trivial to follow this with a longer compare. A possibility exists that some off-by-one errors have crept in. If that's the case, just deduct from my paycheck. :-) Alpha skip is quadratic in the worst case and probably not the best choice when nasty matches are expected. Anyway, I'm curious how this might hold up against the optimized brute-force routines.
<Reveal this spoiler or all in this thread>
Note: to change the needle size, just edit the M and logM values. (Compile with -fno-strict-aliasing, or fix the type-punning). | [reply] [d/l] |
by BrowserUk (Patriarch) on Mar 29, 2015 at 16:29 UTC | |
Making minimal changes to your code to allow it compile here, and it simply never finds anything(very quickly), no matter how simple I make the search criteria:
I've wasted enough time on it; and won't be expending any more, but here is the modified code that compiles clean (apart from warnings that originate from your posted code): <Reveal this spoiler or all in this thread>
I'd suggest you enter it in an obfuscated C competition; could be a winner. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Mar 29, 2015 at 09:38 UTC | |
'm curious how this might hold up against the optimized brute-force routines. It's 3 orders of magnitude slower. | [reply] |
by oiskuu (Hermit) on Mar 29, 2015 at 10:32 UTC | |
More or less, yeah. I see your table lists "2.483719e-001" for the corresponding case. Brute force appears to be >300 times slower (0.2483719/0.000644).
Looks like { M = 16384, logM = 14 } case is | [reply] [d/l] |
|
Re^2: [OT] The interesting problem of ... bit-strings (alpha skip search)
by oiskuu (Hermit) on Apr 01, 2015 at 10:18 UTC | |
I'll attempt a brief description of the Alpha Skip Search. Hopefully, this may shed some light on the basic operation of the algorithm. Consider a pattern of length M. The approach is to quickly locate any fractional substring within the pattern, wherever it may occur. Substrings of length F take offsets from 0 .. M-F, inclusive. Thus there is a set of M-F+1 needle fragments. The preprocessing stage creates an index of those fragments. A perl implementation might read as follows: push @{$index{ substr($needle, $_, $F) }}, $_ for 0 .. $M-$F; Search phase proceeds by inspecting the haystack one window at a time. Window shift is the same figure as above, M-F+1. Each time, a substring is extracted at the end of the window. The index now yields all possible alignments where this fragment will match in the pattern. Theoretical optimum for F is logsM (log based on alphabet size). That amounts to one entry per index bucket, on the average. For random data, there will be one alignment to test per window, and this is likely to fail at the very start. In the case of bit-string matching, the substrings are nothing but small integers below (1<<F), which suggests a very simple array-based indexing scheme (instead of trie). Here's a sample implementation in C (just a rehash of the previous demo, main difference being this one handles variable length needles).
Tested with gcc/clang under linux; my apologies if the code is too obscure, or otherwise unworkable. Update. Simplified above code. Note that it is sometimes difficult to put a name on a string search routine, especially with bit-strings. Minor modification can turn one search into another, and the variants abound. Here's a version that might be called Horspool.
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Apr 01, 2015 at 11:47 UTC | |
Despite saying I wouldn't look at this further; since you updated, I did. But I didn't get very far. Once I commented out a couple (non-existent on my system) unix-y headers -- no idea if they are necessary or not:
I got this:
Which I fixed by changing 1L to 1ull:
Which got me to:
From these 3 lines:
I hope you'll understand that I'm not being unfair when I say I have simply no clue how to fix that...but I did try. (Quite why you need to invent your own special integer types all over the place is beyond my understanding.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] [d/l] [select] |
|
Re^2: [OT] The interesting problem of comparing (long) bit-strings.
by Anonymous Monk on Mar 28, 2015 at 08:46 UTC | |
| [reply] |
by salva (Canon) on Mar 28, 2015 at 08:50 UTC | |
| [reply] |
by Anonymous Monk on Mar 28, 2015 at 20:03 UTC | |
Hm? Boyer-Moore prep is the same, O(m) in time and space. A good-shift table of length m is used. It might be more compact in practice. If the problem is too large, divide it! Code a tuned search routine for fixed length (sub)needles, such that all tables fit into L2. My guess is KB needles can be done. Now the search becomes two-level. Top-level "alphabet" are the subneedles. As for brute-force. The "Boyer-Moore beating assembler versions that utilize microcoded string search instructions" is a decades-old lesson. Tried to google for some edutaining links, but sadly, google has become ****. | [reply] |
by BrowserUk (Patriarch) on Mar 29, 2015 at 04:17 UTC | |
As for brute-force. The "Boyer-Moore beating assembler versions that utilize microcoded string search instructions" is a decades-old lesson For string search, decades ago (before cpus acquired high-speed, on-chip caches) Boyer Moore was the business, but since the 486 circa. 1990, on-chip caches have tipped the balance again. "Decades old lessons" were relevant, decades ago. If you won't take my words for it; and aren't equipped to try it for yourself, take Bjarne Stroustrup's word for it. View http://channel9.msdn.com/Events/GoingNative/GoingNative-2012/Keynote-Bjarne-Stroustrup-Cpp11-Style and skip ahead to timecode 44:40. A few seconds in he presents a question (on a slide titled "Vector vs. Linked List"), that goes like this: The task comes in two parts: And the question is: For which N does it become faster to use a linked list than a flat array? The decades old wisdom on this is that insertions/deletions into/from an array are O(N2), but for a linked list they are O(N log N) (with slightly higher constant factors), so N only need grow to a couple of hundred or a 1000 or so for the linked-list to start outperforming the flat array. Right? Wrong! Watch for a few more seconds (and wonder in amazement at his non-existent graph) as he shows that the answer is "Never". At no point does the linked list become faster than the flat array. In fact the bigger N gets, the bigger the gap between the two becomes, in favour of the flat array. As counter-intuitive as it may be, the benefits of the caching on the sequential accesses made by the flat array, far outweigh the costs of moving more data. The non-sequential, scattered around memory nature of linked-lists means they never catch up, let alone surpass the performance of the cache-friendly contiguous memory accesses of the flat array. Times change; and a lot of the "knowledge" CS students acquired, despite being tortured within an inch of their lives by Knuth; is simply out-of-date. Whilst the proofs of correctness still hold; the analysies of operational complexity -- done in terms of the imaginative, but imaginary MIX (later MMIX) -- simply do not consider the effects of caching, pipelining, branch prediction, memory stalls et al. Stroustrup concludes that there are "3 orders of magnitude performance" to be gained. And, the relevance of string search analysis to bit-string search, has to my knowledge, never been demonstrated. They are quite different animals. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
| [reply] |
by salva (Canon) on Mar 31, 2015 at 06:24 UTC | |
by BrowserUk (Patriarch) on Mar 31, 2015 at 09:42 UTC | |