Urk. Again, any approach that requires comparing two documents directly is going to eat me alive, as I have thousands of these every day. N-squared, y'know.
To restate my desired outcome: I can checksum each document, and then get a zero-or-one answer by comparing checksums. But what I really want is a "fuzzy" checksum, kind of like taking a thumbnail of an image and comparing the thumbnails. That led me to the approach of throwing out all the short words, whitespace, punctuation, etc. and checksumming the resulting string.
| -- |
| Jeff Boes |
| Database Engineer |
| Nexcerpt, Inc. |
|
|
|
...Nexcerpt...Connecting People With Expertise
|
| [reply] [d/l] |
understood...
What about compressing the document with something like huffman encoding which would then shorten all of the words, replacing them with keys for repeated instances so that would really compress the text. You could even then compare the 'keys' it uses as replacements for comparison of like text. Going further, but this might be pushing it is just store the keys it uses (i.e. the header from the compression) as these would be replaced based on frequency of use and then you could eliminate all of the short words then.
Just a thought.. =)
Regards Paul
Update:
You still would have to compare the 'signatures' which would still be very time consuming, but the only way I can see around this with this method is say you use this header that is generated and pick the top 10 most used chunks (as it can be words or phrases), and then sort them alphabetically. Then store your document in directories and subdirectories based off each word. i.e.
\stars\moon\rocks\
has
RocksoftheMoon.doc and GanymedeGeology.doc
for instance....
This way you evaluate you document when you get it and store it in a specific place and like documents just end up in the same directory....or at least nearby..
Regards Paul
| [reply] |