Re: Refining a 'vector space search'.
by dws (Chancellor) on Jun 28, 2003 at 16:56 UTC
|
And the more unique words you have in all your combined documents, the more places/values you have. In one test, I calculated that there would currently be 22,000+ values in each vector/array (most are zeroes).
Was that count before or after stemming? When I played with vector space searching, stemming reduced the number of "words" significantly. But yes, you do end up with big vectors.
| [reply] |
|
This was after stemming, but it was using the default set of words. Either way, with so many documents (and all of them being written by a different person -- these are auction descriptions) the variety and number of truly unique and non-stopped words is going to be really high.
Can you (or anyone) confirm for me that my understanding is correct, in that you need the entire vector for each document to compare against your search query's vector when you cosine them? That is, you couldn't perform some function ahead of time on your vector to reduce the size of it (yet still have the same representation) and have it be compared properly to the search query vectors? Since the two are so interrelated, I don't know that there could be a way to do that since each search is unique and performing some shrinking computation on it ahead of time would destroy the ability to gather results against each different set of query words...
Like I say, I was just sort of hoping against hope here. I need a viable solution for searching on my site because I have hundreds of thousands of old (closed) auctions, thousands of existing (open) ones and then hundreds of thousands of posts in the forums. So as you can see, simple SQL queries are really not cutting it any longer - both in speed and accuracy (not to mention flexibility). I'm not even sure that I could properly use a reverse index method on this type of setup. Aaaargh.
| [reply] |
|
(0, 1, 0, 0, 0, 1, 0, 0, 1, 0)
than you can encode it by a list containing just:
( 1, 5, 8 )
This can be a huge space saver if the vector is really large. It is very easy to calculate the cosine between vectors in that representation: just count the number of common elements in the lists and divide by the square root of product of the length of each of the two lists.
Hope this helps, -gjb- | [reply] [d/l] [select] |
|
|
Re: Refining a 'vector space search'.
by TomDLux (Vicar) on Jun 29, 2003 at 06:34 UTC
|
It's been a few years since I've done math, but checking definitions at wolfram.com convincecs me the following is correct. Just don't use the information in any life-or-death situation.
The reason you have a value with floating point values, above, is because the vector has been normalized. But if you only care about existence, not frequency, both storage and procecssing speed can be drasticallyimproved.
| [reply] [d/l] [select] |
|
| [reply] |
Re: Refining a 'vector space search'.
by Anonymous Monk on Jun 29, 2003 at 07:19 UTC
|
there is an idea which bounces around the data-mining community that goes like this: the most important words are repeated only a few times in a document. take an article that talks about something specific, like unit testing in perl and see how many times the words "unit", "test" and "perl" appear in the text as opposed to the words "code", "module", "method", "error", etc. so this sorta-kinda-maybe justifys the following simplification:don't count the number af times a word appears in a document, just consider if it's there or not. so now we can use a bit vector as opposed to a "real" vector, sparse bit vectors have a wonderfully small representation: a list of all the indexes at 1. while i immagine you're set of known words will be quite big odds are the set of words in each document will be small, so this reppresentation will save you a lot of space.
anyway, you decide what the best in memory reppresentation is for your data, then i suggest you use that. this is a very specific application and one which, imho, doesn't map well onto the entity relation view rdbs have of the world. if your application can handle looong start-up times (most can) you could just use Data::Dumper to accasionaly "snapshot" the data, or maybe a prevalance-esque solution would be a good idea.
the question is: what do you want to know about that vector? you're not really interested in the vector position in some n-dimensional space, but all you care about is it's distance from other vectors in this space. well, how do you calculate that distance? odds are you'll do what the article says and take the cosine distance, so what info do we need about a document in order to take the cosine distance? we need the magnitude of both vectors and we need the inner product. the magnitude we can pre-calculate (the square root of the sum of the elements squared, we can avoid squaring since all our elements are either one or zero, we can't avoid square rooting but we only have to do this when we insert a new document, so that's not such a bg deal). how does the inner product work? it's sum_{i=0}^{n} a_i * b_i, which in our case (we're using sparse vectors here) is just the size of the intersection of the indexes in a and b, but since those volues were inserted in order (got that: keep the list of indexes ordered) this is a quick scan through the list of indexes in a or b (use whichever is shorter).
now the last thing we'd like to touch on is the fact that for every query we have to check it against each and every other document. you've got solutions to this: 1) query caching (i'm not going to go into this any further) 2) vector space partitioning. odds are you're going to have your documents distrubited not unifromly but into various sets of similar documents (clusters). clustering algocithms are still being researched so there's no real "answer" but the idea is this: every now and then (vague definition, i know) you go through the db and partiton it into documents which aren't orthogonal between themselves (ie sets of documents with the same words in them). then when you get a query you can limit your search to only the set of documents which must have a cos distance > 0. sorry if this last bit ins't as clear as i'd like.
hope this helps.
| [reply] |
Re: Refining a 'vector space search'.
by chunlou (Curate) on Jun 29, 2003 at 00:58 UTC
|
As a naive way, why don't you try to collapse the unique words with high correlation into statistical synonyms?
If a group of unique words tend to occur in the same document often. You could just choose one as representative in your vector, the others stored as synonyms for that representative. (And yes, that would make your search a two-step process.)
As for what value asigned to a representative word, there's no theoretically the best way, but the sum of the frequency of all correlated words is not the way at all, as it will grossly bias the vector.
An unsophisticated way to do it is to simply take the simple average frequency. A "better" way would be to use factor analysis or any dimension reduction technique in statistics to empirically figure out the weights for different words so as to come up with a weighted average that way. | [reply] |
Re: Refining a 'vector space search'.
by choocroot (Friar) on Jun 29, 2003 at 02:42 UTC
|
If you cannot use fixed length vectors because the full set of document is too big or unknown (because it's a stream of documents) and can evolve, then you could work with hashes. You store for each documents the terms with their frequencies. In fact it's like keeping only the terms with non-zero frequencies. In a database, that could be represented with a "huge" table with document id, terms, and frequency columns:
docid | term | freq.
Then, for the search, you retrieve the document hash from the db, and expand it to a vector with all the terms (build the "full vector" with 'SELECT DISTINCT term FROM index_table' with everything set to zero, then place your document term/freq in it), so you can compute your cosine ...
| [reply] |
Stemming modules
by Bluepixel (Beadle) on Jun 29, 2003 at 19:41 UTC
|
Is there a module on cpan that detects the language a text is written in? I can't find one.
I have multiple texts which are in german, french , english, protugese, etc... and I need to know which stemming language module to apply on the document.
| [reply] |
Turning this module into a persistent web app
by davebaker (Pilgrim) on Jun 03, 2004 at 21:39 UTC
|
I am so impressed with this nifty module. It is awesome for finding "similar" documents, the way the old Excite for Web Servers search engine did (the "More Like This One" button).
The unusual aspect of this search technique is that searches become more accurate the larger the query is ... you can input the entire text of a document and the search engine returns a list of documents like it.
I made a modification to it so that I'd see a document ID in the command-line list of results (in addition to the filename), so that I can input the document ID in order to in effect provide all the terms in that document as the new query ... the result is awesomely accurate.
I'd love to have a web interface for this module and give it a try on a real site. I guess the first big obstacle is to turn the module into a daemon so that once all the vectors are created they could "hang around" without having to be recreated each time the search engine is used. Has anybody done any work in that regard? | [reply] |