in reply to Fingerprinting text documents for approximate comparison
My guess would be to extract all words from the web page and create a bit vector where you set a bit for each word with some hashing algorithm. Select the size of the bit vector in such a way that the ratio of bits set would not be close to 1. If the bit vector would be too large this way, store only a certain sized slice of it (thus including only those words whose hash value is in a certain interval). You can then count the bits that are in only one of the bit vectors to get the approximate distance of the texts.
See also secondary indexing in Knuth volume 3, which surely teaches you much more about this topic.
|
|---|