Yeah. I first encountered and attempted to use bloom filters back in the late 80's.
A great mentor of mine at that time once described bloom filters as
A partial solution looking for a partial problem to partially solve.
And on the basis of my experience of trying to use them on that occasion and two occasions since then, I have no reason to argue with that.
They sound great on paper, and many people advocate them on the basis of the configurable low "false positives" rate and low memory consumption, without ever really understanding the implications.
The bits people miss are
You need to do a secondary, guaranteed and therefore usually slow, lookup if the bloom filter returns positive.
If your application has a preponderance of positive lookups, the use of the bloom filter is overhead most of the time.
Obviously, for some applications they are an economic and efficient solution, if you get the hashing functions correct--which I found non-trivial, but then I probably do not understand the math--but you still need to have a real data lookup mechanism, and the real data, available.
Once you get your bloom filter to operate, it's this secondary mechanism that becomes the time/memory consuming part of the process, and inevitably, this is where you have to concentrate your tuning efforts. On the few occasions I've considered using BFs, once we had tuned the secondary mechanism, we discovered that it was fast enough/or had a small enough footprint to stand on it's own merits without the complication of finding 3, 4 or more suitable and compatible hashing algorithms for the BF.
Maybe my lack of a full understanding of the math made the hashing harder than it should be and had I used a 'canned' BF solution like one of those on CPAN I might have had different experiences. Or maybe, the applications to which I tried to apply them simply were not suitable for this solution.
Either way, I think that BFs are often advocated inappropriately, and many times when they are used, better solutions are possible. Often, BFs are advocated for hashing applications that require large numbers of large keys, and they are written to hash entire key fields. But sometimes a little analysis can isolate smaller subsets of those keys that mean indexes can be smaller and so fit in memory and avoid the expensive secondary storage lookup. Sometimes, a perfect hashing algorithm can be found that achieves the same end.
Maybe they would work for this application, I haven't really thought that through, but thrice bitten, fo..er. fri... er..forth time shy :)
In reply to Re^4: Searching parallel arrays.
by BrowserUk
in thread Searching parallel arrays.
by BrowserUk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |