in reply to Searching parallel arrays.
The input phase reads the arrays to switch on the N-th bit of a vector. After a run length encoding you are looking for every 2nd value (that represents a block of 1's) >3 .
Depending on the expected hit density it pays to loose the relation to the input array and 'grep' it again.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Searching parallel arrays.
by BrowserUk (Patriarch) on Dec 08, 2006 at 21:14 UTC | |
That's a really interesting idea. In this case I'm not sure its workable as, unlike my example, the integers in question can potentially take the full (unsigned) range of values--they are in fact addresses--so the bit vectors could get pretty large. Even if I scale the values (the addresses will always be at least 4-byte aligned), the range is still potentially 2**29 on a 32-bit cpu, which would make for some pretty sparse bitvectors. RLE might help, but then you loose the advantage of being able to use bitstring logical operations on them. Later on I might be able to fursther reduce the range by excluding a large chunks at the bottom (and maybe top) of memory, and I might be able to rely on 12 or even 16 byte alignment. That would hugely reduce the size of the bitvectors and coud bring this into the realms of feasability. Thanks. If anyone knows of how to obtain the lower and upper bounds of the dynamic memory of the current Perl process, I'd love to hear from them. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by NiJo (Friar) on Dec 09, 2006 at 19:35 UTC | |
http://www.perl.com/pub/a/2004/04/08/bloom_filters.html You need to loop through the arrays once to create the vector and a second time for the test [N-2..N+2]. | [reply] |
by BrowserUk (Patriarch) on Dec 10, 2006 at 02:33 UTC | |
Yeah. I first encountered and attempted to use bloom filters back in the late 80's. A great mentor of mine at that time once described bloom filters as A partial solution looking for a partial problem to partially solve. And on the basis of my experience of trying to use them on that occasion and two occasions since then, I have no reason to argue with that. They sound great on paper, and many people advocate them on the basis of the configurable low "false positives" rate and low memory consumption, without ever really understanding the implications. The bits people miss are Obviously, for some applications they are an economic and efficient solution, if you get the hashing functions correct--which I found non-trivial, but then I probably do not understand the math--but you still need to have a real data lookup mechanism, and the real data, available. Once you get your bloom filter to operate, it's this secondary mechanism that becomes the time/memory consuming part of the process, and inevitably, this is where you have to concentrate your tuning efforts. On the few occasions I've considered using BFs, once we had tuned the secondary mechanism, we discovered that it was fast enough/or had a small enough footprint to stand on it's own merits without the complication of finding 3, 4 or more suitable and compatible hashing algorithms for the BF. Maybe my lack of a full understanding of the math made the hashing harder than it should be and had I used a 'canned' BF solution like one of those on CPAN I might have had different experiences. Or maybe, the applications to which I tried to apply them simply were not suitable for this solution. Either way, I think that BFs are often advocated inappropriately, and many times when they are used, better solutions are possible. Often, BFs are advocated for hashing applications that require large numbers of large keys, and they are written to hash entire key fields. But sometimes a little analysis can isolate smaller subsets of those keys that mean indexes can be smaller and so fit in memory and avoid the expensive secondary storage lookup. Sometimes, a perfect hashing algorithm can be found that achieves the same end. Maybe they would work for this application, I haven't really thought that through, but thrice bitten, fo..er. fri... er..forth time shy :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |