isync has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Assuming you have a retrieval task where you need to know the "nearest neighbours"-set or drill down to find the "one exactly matching"-entry among a vast number of entries. What would be the best scheme to store such data and look that up? I am talking about the "find the needle in the haystack" by "applying filter after filter" problem.


Let me elaborate a bit on the scenario again: Given, we have a vast number of entries, each having a list of properties. These properties form data vectors, a dregree of uniqueness related to their number of properties and the frequency of these properties. One example would be data like Panopticlick's ("is your browser fingerprint unique"), or a product database relying on tags.
Thing:
  colour: grey
  material: wood
  surface: rough

Now, I want to look that up, but more in a "has this AND this AND this..." manner. Having that said, the product db is probably not such a good example. But it's similar data... Anyway. I want to filter the data iteratively, with the expectation to find one exact match, or, if that fails, getting back a number of close/next neighbours. Properties/ filters may be added in no particular order.

One poor implementation (I think), especially with big data and many properties, is a relational db. I went down that road and ended up with concatenating a dynamic SQL sentence with *many* "AND WHERE .. IS", expeting that this will bog down once I hit a certain number of entries.

Having done some search stuff, my second idea was an inverted index, as I know how well these perform on large datasets. I could aggregate matching entries for each of the properties/features, (treating them as postings), getting back a number of results for each property. The intersection of these sets would be my result set. Search::GIN might be of help here.
But is that how a project like Panopticlick calculates "similarity" across bit-vectors?
Speaking of bit-vectors: I don't know enough about the maturity of bit-vector related modules up on CPAN, like MMaciej's Search::VectorSpace, to judge if they are any good for production tasks. Actually I've never used vectors at all. But from what I seem to understand about vectors, using this math requires my properties to be sorted, right?

So again, fellow Monks, is that how it is best done, inverted indices? How would you find the needle in the stack?

  • Comment on What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

Replies are listed 'Best First'.
Re: What is the best way to store and look-up a "similarity vector"?
by BrowserUk (Patriarch) on Nov 14, 2013 at 17:07 UTC
    But is that how a project like Panopticlick calculates "similarity" across bit-vectors?

    I've no idea about pana-pano, that thing, but ostensibly, it can be as simple as counting the number of matching bits (properties):

    my $needle = getBits(); my $nBits = unpack '%32b*', $needle; for my $straw ( @haystack ) { my $similarity = unpack '%32b*', $straw & needle; print "Percentage similarity %f\n", $similarity / $nBits * 100; }

    For parts/product catalogue type applications, bit-mapped data records can be very effective and efficient, because their attributes tend to have a fixed (and limited) number of values. Ie. Half a dozen colors; half a dozen sizes; 2 or 3 finishes; etc. That means each choice can be represented by a bit position in a vector. Selection can be extremely fast.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Yes, totally agree. For example, where would file-systems be without bit-maps?!

      My problem is that my data is more or less schema-less. I know about a certain amount of properties that might show up, but similar to the "Browser-fingerprint" problem, I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout. Knowing all varieties is then similar to an algorithm that requires me to have properties in a fixed order, to align them to each other.
        I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout.

        Use a two-step schema.

        1. Each attribute/value pair gets added to a single table with an auto-incrementing primary key when it is first seen.

          Using your example above, 'color:grey', 'material:wood', & 'surface:rough' are your attribute/value pairs (A/V pairs).

          The auto-incremented 'id' value becomes that A/V pair's bit position in the data records.

        2. The data records are variable length strings of bits.

          As the number of A/V pairs increase, so do the length of (new) records, but the bit positions of existing A/V pairs retain their original meanings, so the schema does not need to change.

        To select records, you build up a bit mask with bits set for the sought A/V pairs, and then AND it with each record. For similarity, count the resultant bits.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: What is the best way to store and look-up a "similarity vector"?
by educated_foo (Vicar) on Nov 14, 2013 at 16:31 UTC
    Assuming you have a retrieval task where you need to know the "nearest neighbours"-set or drill down to find the "one exactly matching"-entry among a vast number of entries.
    These are completely different problems. To find exactly matching things, you can just serialize and use hashing. To find "close" things, you need a measure of "closeness," and a much more complicated data structure.
    Speaking of bit-vectors: I don't know enough about the maturity of bit-vector related modules up on CPAN to judge if they are any good for production tasks.
    vec and the bitwise string operators work pretty well.
      I think, I'm applying a human perspective on information retrieval here. I didn't even thought about something as exact as hashing.
      Although I remember some digests, could've been TEA, actually behave like the data I was referring to. Hashes begin to differ more when the underlying data differs more. Like
      "Foo a" would hash to "a1b2c3d4", and
      "Foo b" would hash to "a1b2c3d5"
      
      - they are similar to a certain degree. A query for "a1b2c3d*" in a similar as the proposed system would return both. But the example requires the properties to adhere to a certain order.

      Back to the "human perspective": what I have in mind is something like "is this your car?", you begin to apply filters: right color, make, etc. But how sure can you be, ever? "Matching" in life is an approximatory process - question is: how do I implement that as an algorithm?
        I've used locality-sensitive hashing before; that might be something like what you remember. As for what you actually want... if I knew how to do it, I'd probably be charging a lot of money for it.
        Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs.

        Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc.

        Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.

Re: What is the best way to store and look-up a "similarity vector"?
by isync (Hermit) on Nov 14, 2013 at 19:03 UTC

    The previous sub-thread suggests, building bitmaps as record-"fingerprints" (binary patterns, based on a two-step/abstracting schema) and on look-up comparing them against a query-"fingerprint" might be is an efficient solution, as it relies on binary operators and binary (core::) functions. A low level one, though. (Implementing this might be reinventing the wheel.. search CPAN first.)

    Stemming from that, looking at tools available, from what I know, inverted-index engines, like Lucy, Plucene or Sphinx, do exactly that. No? They look-up postings/terms, assign an id and store these ids in with the help of bitmaps. On lookup, they do the aforementioned comparisons and apply boolean operations (intersection).

    So, would it be a sane setup to use a search-engine backend as off-the-shelf bitvektor-db? Only problem I see is that applying too many filters would result in an empty set. I don't know if any of the mentioned backends allows for something like "cancel this property/filter and you'll get this more results" sort of similarity option...