Assuming you have a retrieval task where you need to know the "nearest neighbours"-set or drill down to find the "one exactly matching"-entry among a vast number of entries. What would be the best scheme to store such data and look that up? I am talking about the "find the needle in the haystack" by "applying filter after filter" problem.
Thing: colour: grey material: wood surface: rough
Now, I want to look that up, but more in a "has this AND this AND this..." manner. Having that said, the product db is probably not such a good example. But it's similar data... Anyway. I want to filter the data iteratively, with the expectation to find one exact match, or, if that fails, getting back a number of close/next neighbours. Properties/ filters may be added in no particular order.
One poor implementation (I think), especially with big data and many properties, is a relational db. I went down that road and ended up with concatenating a dynamic SQL sentence with *many* "AND WHERE .. IS", expeting that this will bog down once I hit a certain number of entries.
Having done some search stuff, my second idea was an inverted index, as I know how well these perform on large datasets. I could aggregate matching entries for each of the properties/features, (treating them as postings), getting back a number of results for each property. The intersection of these sets would be my result set. Search::GIN might be of help here.
But is that how a project like Panopticlick calculates "similarity" across bit-vectors?
Speaking of bit-vectors: I don't know enough about the maturity of bit-vector related modules up on CPAN, like MMaciej's Search::VectorSpace, to judge if they are any good for production tasks. Actually I've never used vectors at all. But from what I seem to understand about vectors, using this math requires my properties to be sorted, right?
So again, fellow Monks, is that how it is best done, inverted indices? How would you find the needle in the stack?
In reply to What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)? by isync
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |