Re: Fingerprinting text documents for approximate comparison

Hi,
How about doing a diff on the two files then dividing the size of the diff with the size of the file to give you a percentage of similarity. You might want to prune the diff output so its only giving the results from the comparison file otherwise you're going to have a little twice the expected size in the diff as both results are posted plus some fluff from Diff.
This is more relavant than just comparing the size of the files which is obviously not a solution in that is does give consideration to content.

Regards Paul

Comment on Re: Fingerprinting text documents for approximate comparison

Replies are listed 'Best First'.

Re^2: Fingerprinting text documents for approximate comparison
by Mur (Pilgrim) on Mar 24, 2005 at 21:34 UTC

To restate my desired outcome: I can checksum each document, and then get a zero-or-one answer by comparing checksums. But what I really want is a "fuzzy" checksum, kind of like taking a thumbnail of an image and comparing the thumbnails. That led me to the approach of throwing out all the short words, whitespace, punctuation, etc. and checksumming the resulting string.

Jeff Boes

Database Engineer

Nexcerpt, Inc.

vox 269.226.9550 ext 24

fax 269.349.9076

http://www.nexcerpt.com

...Nexcerpt...Connecting People With Expertise

[reply]
[d/l]

Re^3: Fingerprinting text documents for approximate comparison

by thekestrel (Friar) on Mar 24, 2005 at 23:02 UTC

Update:
You still would have to compare the 'signatures' which would still be very time consuming, but the only way I can see around this with this method is say you use this header that is generated and pick the top 10 most used chunks (as it can be words or phrases), and then sort them alphabetically. Then store your document in directories and subdirectories based off each word. i.e.

\stars\moon\rocks\
has RocksoftheMoon.doc and GanymedeGeology.doc
for instance....
This way you evaluate you document when you get it and store it in a specific place and like documents just end up in the same directory....or at least nearby..

Regards Paul

[reply]