in reply to Re^2: [OT] The interesting problem of comparing bit-strings.
in thread [OT] The interesting problem of comparing bit-strings.
Ahhh.... the bitstrings are very large? How about the typical length of the patterns for which you search within such strings?
If the length of the string were longer than 3 * 64 = 192 bits long, and especially if it were longer, then you would have a definite “crib,” in the form of a 64-bit quantity ... there are exactly 64 possibilities ... that must occur in the pattern at a quadword boundary. This would be a “huge win™,” well worth the extra effort of creating a special-case for, in anticipation that it would actually pay-off most of the time.
I-F the string being sought was at least 192 bits long, then you are able to calculate, with certainty, the 64 possible values that exist for “the 64-bit ‘word in the middle.’” You can now search for each of these in turn, using an extremely efficient REPNE SCASD (i86 ...) assembler-language construct, and then explore the surrounding doublewords in each case to determine if the bit-string occurs. (If the run being sought is “even longer,” then a REPNE CMPSD loop can be applied to check all of the “subsequent doublewords,” the values of all of which once again can be determined with certainty. Only the partial-doublewords at the beginning and at the end, if any, remain to be verified.)
Easily the most efficient implementation would be to loop through all 64 of the pre-computed possibilities, using the assembler instructions aforementioned to locate all occurrences within the buffer. (Of course, and as you of all Monks well know, you will need to use a “sliding window” scan through the compass of the entire file, to avoid missing any bitstrings that serependitiously fell across a boundary, but, I digress.)
Each and every time you stumble-upon any of these 64 values, the odds that you have, in fact, “found another answer” are rather-astronomically high.
In fact, in such a case you might even wish to “take a calculated risk” that in fact you need look no further. If you actually discover a matching 64-bit quantity within the buffer, then the odds are “64 in 2^64” that the entire value-being-sought is present. Certainly, if the REPNE CMPSD succeeds with regard to a location found by REPNE SCASD, then the odds of being wrong become absurdly (and dismissably) small: “It is a match. It has to be.” At this point, the actual odds of committing a “type-2 error” are . . . zero.
No, this is not a solution to your entire problem: if the bit-string being sought turns out to be shorter, it will not help at all. But, for longer strings, it will cause the problem to “virtually disappear.”
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: [OT] The interesting problem of comparing bit-strings.
by BrowserUk (Patriarch) on Mar 25, 2015 at 00:52 UTC | |
by salva (Canon) on Mar 25, 2015 at 09:47 UTC | |
by BrowserUk (Patriarch) on Mar 25, 2015 at 11:40 UTC | |
by salva (Canon) on Mar 25, 2015 at 11:59 UTC | |
by BrowserUk (Patriarch) on Mar 25, 2015 at 12:23 UTC | |
| |
by salva (Canon) on Mar 25, 2015 at 12:23 UTC | |
by BrowserUk (Patriarch) on Mar 25, 2015 at 12:45 UTC | |
|