Yeah. I first encountered and attempted to use bloom filters back in the late 80's.

A great mentor of mine at that time once described bloom filters as

A partial solution looking for a partial problem to partially solve.

And on the basis of my experience of trying to use them on that occasion and two occasions since then, I have no reason to argue with that.

They sound great on paper, and many people advocate them on the basis of the configurable low "false positives" rate and low memory consumption, without ever really understanding the implications.

The bits people miss are

  1. they only work for efficiency if your application involves a large proportion of negative to positive lookups.

    You need to do a secondary, guaranteed and therefore usually slow, lookup if the bloom filter returns positive.

    If your application has a preponderance of positive lookups, the use of the bloom filter is overhead most of the time.

  2. If your hits are mostly positive, then you still have to do the slower or memory hungry lookup to confirm your hits, so the memory devoted to the bloom filter becomes overhead also.

Obviously, for some applications they are an economic and efficient solution, if you get the hashing functions correct--which I found non-trivial, but then I probably do not understand the math--but you still need to have a real data lookup mechanism, and the real data, available.

Once you get your bloom filter to operate, it's this secondary mechanism that becomes the time/memory consuming part of the process, and inevitably, this is where you have to concentrate your tuning efforts. On the few occasions I've considered using BFs, once we had tuned the secondary mechanism, we discovered that it was fast enough/or had a small enough footprint to stand on it's own merits without the complication of finding 3, 4 or more suitable and compatible hashing algorithms for the BF.

Maybe my lack of a full understanding of the math made the hashing harder than it should be and had I used a 'canned' BF solution like one of those on CPAN I might have had different experiences. Or maybe, the applications to which I tried to apply them simply were not suitable for this solution.

Either way, I think that BFs are often advocated inappropriately, and many times when they are used, better solutions are possible. Often, BFs are advocated for hashing applications that require large numbers of large keys, and they are written to hash entire key fields. But sometimes a little analysis can isolate smaller subsets of those keys that mean indexes can be smaller and so fit in memory and avoid the expensive secondary storage lookup. Sometimes, a perfect hashing algorithm can be found that achieves the same end.

Maybe they would work for this application, I haven't really thought that through, but thrice bitten, fo..er. fri... er..forth time shy :)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^4: Searching parallel arrays. by BrowserUk
in thread Searching parallel arrays. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.