Re: Searching parallel arrays.

Replies are listed 'Best First'.
Re^2: Searching parallel arrays. by BrowserUk (Patriarch) on Dec 08, 2006 at 21:14 UTC
That's a really interesting idea. In this case I'm not sure its workable as, unlike my example, the integers in question can potentially take the full (unsigned) range of values--they are in fact addresses--so the bit vectors could get pretty large. Even if I scale the values (the addresses will always be at least 4-byte aligned), the range is still potentially 2**29 on a 32-bit cpu, which would make for some pretty sparse bitvectors. RLE might help, but then you loose the advantage of being able to use bitstring logical operations on them. Later on I might be able to fursther reduce the range by excluding a large chunks at the bottom (and maybe top) of memory, and I might be able to rely on 12 or even 16 byte alignment. That would hugely reduce the size of the bitvectors and coud bring this into the realms of feasability. Thanks. If anyone knows of how to obtain the lower and upper bounds of the dynamic memory of the current Perl process, I'd love to hear from them. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Searching parallel arrays. by NiJo (Friar) on Dec 09, 2006 at 19:35 UTC
I understand that scaling issue. In this situation it makes sense to use a Bloom::Filter, basically a compressed bit vector. The algorithm scales well but produces false positives: http://www.perl.com/pub/a/2004/04/08/bloom_filters.html You need to loop through the arrays once to create the vector and a second time for the test [N-2..N+2].	[reply]
Re^4: Searching parallel arrays. by BrowserUk (Patriarch) on Dec 10, 2006 at 02:33 UTC
Yeah. I first encountered and attempted to use bloom filters back in the late 80's. A great mentor of mine at that time once described bloom filters as A partial solution looking for a partial problem to partially solve. And on the basis of my experience of trying to use them on that occasion and two occasions since then, I have no reason to argue with that. They sound great on paper, and many people advocate them on the basis of the configurable low "false positives" rate and low memory consumption, without ever really understanding the implications. The bits people miss are they only work for efficiency if your application involves a large proportion of negative to positive lookups. You need to do a secondary, guaranteed and therefore usually slow, lookup if the bloom filter returns positive. If your application has a preponderance of positive lookups, the use of the bloom filter is overhead most of the time. If your hits are mostly positive, then you still have to do the slower or memory hungry lookup to confirm your hits, so the memory devoted to the bloom filter becomes overhead also. Obviously, for some applications they are an economic and efficient solution, if you get the hashing functions correct--which I found non-trivial, but then I probably do not understand the math--but you still need to have a real data lookup mechanism, and the real data, available. Once you get your bloom filter to operate, it's this secondary mechanism that becomes the time/memory consuming part of the process, and inevitably, this is where you have to concentrate your tuning efforts. On the few occasions I've considered using BFs, once we had tuned the secondary mechanism, we discovered that it was fast enough/or had a small enough footprint to stand on it's own merits without the complication of finding 3, 4 or more suitable and compatible hashing algorithms for the BF. Maybe my lack of a full understanding of the math made the hashing harder than it should be and had I used a 'canned' BF solution like one of those on CPAN I might have had different experiences. Or maybe, the applications to which I tried to apply them simply were not suitable for this solution. Either way, I think that BFs are often advocated inappropriately, and many times when they are used, better solutions are possible. Often, BFs are advocated for hashing applications that require large numbers of large keys, and they are written to hash entire key fields. But sometimes a little analysis can isolate smaller subsets of those keys that mean indexes can be smaller and so fit in memory and avoid the expensive secondary storage lookup. Sometimes, a perfect hashing algorithm can be found that achieves the same end. Maybe they would work for this application, I haven't really thought that through, but thrice bitten, fo..er. fri... er..forth time shy :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]