Hi Monks,

Assuming you have a retrieval task where you need to know the "nearest neighbours"-set or drill down to find the "one exactly matching"-entry among a vast number of entries. What would be the best scheme to store such data and look that up? I am talking about the "find the needle in the haystack" by "applying filter after filter" problem.


Let me elaborate a bit on the scenario again: Given, we have a vast number of entries, each having a list of properties. These properties form data vectors, a dregree of uniqueness related to their number of properties and the frequency of these properties. One example would be data like Panopticlick's ("is your browser fingerprint unique"), or a product database relying on tags.
Thing:
  colour: grey
  material: wood
  surface: rough

Now, I want to look that up, but more in a "has this AND this AND this..." manner. Having that said, the product db is probably not such a good example. But it's similar data... Anyway. I want to filter the data iteratively, with the expectation to find one exact match, or, if that fails, getting back a number of close/next neighbours. Properties/ filters may be added in no particular order.

One poor implementation (I think), especially with big data and many properties, is a relational db. I went down that road and ended up with concatenating a dynamic SQL sentence with *many* "AND WHERE .. IS", expeting that this will bog down once I hit a certain number of entries.

Having done some search stuff, my second idea was an inverted index, as I know how well these perform on large datasets. I could aggregate matching entries for each of the properties/features, (treating them as postings), getting back a number of results for each property. The intersection of these sets would be my result set. Search::GIN might be of help here.
But is that how a project like Panopticlick calculates "similarity" across bit-vectors?
Speaking of bit-vectors: I don't know enough about the maturity of bit-vector related modules up on CPAN, like MMaciej's Search::VectorSpace, to judge if they are any good for production tasks. Actually I've never used vectors at all. But from what I seem to understand about vectors, using this math requires my properties to be sorted, right?

So again, fellow Monks, is that how it is best done, inverted indices? How would you find the needle in the stack?


In reply to What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)? by isync

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.