in reply to Re^2: Searching parallel arrays.
in thread Searching parallel arrays.

I understand that scaling issue. In this situation it makes sense to use a Bloom::Filter, basically a compressed bit vector. The algorithm scales well but produces false positives:

http://www.perl.com/pub/a/2004/04/08/bloom_filters.html

You need to loop through the arrays once to create the vector and a second time for the test [N-2..N+2].

Replies are listed 'Best First'.
Re^4: Searching parallel arrays.
by BrowserUk (Patriarch) on Dec 10, 2006 at 02:33 UTC

    Yeah. I first encountered and attempted to use bloom filters back in the late 80's.

    A great mentor of mine at that time once described bloom filters as

    A partial solution looking for a partial problem to partially solve.

    And on the basis of my experience of trying to use them on that occasion and two occasions since then, I have no reason to argue with that.

    They sound great on paper, and many people advocate them on the basis of the configurable low "false positives" rate and low memory consumption, without ever really understanding the implications.

    The bits people miss are

    1. they only work for efficiency if your application involves a large proportion of negative to positive lookups.

      You need to do a secondary, guaranteed and therefore usually slow, lookup if the bloom filter returns positive.

      If your application has a preponderance of positive lookups, the use of the bloom filter is overhead most of the time.

    2. If your hits are mostly positive, then you still have to do the slower or memory hungry lookup to confirm your hits, so the memory devoted to the bloom filter becomes overhead also.

    Obviously, for some applications they are an economic and efficient solution, if you get the hashing functions correct--which I found non-trivial, but then I probably do not understand the math--but you still need to have a real data lookup mechanism, and the real data, available.

    Once you get your bloom filter to operate, it's this secondary mechanism that becomes the time/memory consuming part of the process, and inevitably, this is where you have to concentrate your tuning efforts. On the few occasions I've considered using BFs, once we had tuned the secondary mechanism, we discovered that it was fast enough/or had a small enough footprint to stand on it's own merits without the complication of finding 3, 4 or more suitable and compatible hashing algorithms for the BF.

    Maybe my lack of a full understanding of the math made the hashing harder than it should be and had I used a 'canned' BF solution like one of those on CPAN I might have had different experiences. Or maybe, the applications to which I tried to apply them simply were not suitable for this solution.

    Either way, I think that BFs are often advocated inappropriately, and many times when they are used, better solutions are possible. Often, BFs are advocated for hashing applications that require large numbers of large keys, and they are written to hash entire key fields. But sometimes a little analysis can isolate smaller subsets of those keys that mean indexes can be smaller and so fit in memory and avoid the expensive secondary storage lookup. Sometimes, a perfect hashing algorithm can be found that achieves the same end.

    Maybe they would work for this application, I haven't really thought that through, but thrice bitten, fo..er. fri... er..forth time shy :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.