in reply to Fingerprinting text documents for approximate comparison

Hi,
How about doing a diff on the two files then dividing the size of the diff with the size of the file to give you a percentage of similarity. You might want to prune the diff output so its only giving the results from the comparison file otherwise you're going to have a little twice the expected size in the diff as both results are posted plus some fluff from Diff.
This is more relavant than just comparing the size of the files which is obviously not a solution in that is does give consideration to content.

Regards Paul
  • Comment on Re: Fingerprinting text documents for approximate comparison

Replies are listed 'Best First'.
Re^2: Fingerprinting text documents for approximate comparison
by Mur (Pilgrim) on Mar 24, 2005 at 21:34 UTC
    Urk. Again, any approach that requires comparing two documents directly is going to eat me alive, as I have thousands of these every day. N-squared, y'know.

    To restate my desired outcome: I can checksum each document, and then get a zero-or-one answer by comparing checksums. But what I really want is a "fuzzy" checksum, kind of like taking a thumbnail of an image and comparing the thumbnails. That led me to the approach of throwing out all the short words, whitespace, punctuation, etc. and checksumming the resulting string.

    --
    Jeff Boes
    Database Engineer
    Nexcerpt, Inc.
    vox 269.226.9550 ext 24
    fax 269.349.9076
     http://www.nexcerpt.com
    ...Nexcerpt...Connecting People With Expertise
      understood...
      What about compressing the document with something like huffman encoding which would then shorten all of the words, replacing them with keys for repeated instances so that would really compress the text. You could even then compare the 'keys' it uses as replacements for comparison of like text. Going further, but this might be pushing it is just store the keys it uses (i.e. the header from the compression) as these would be replaced based on frequency of use and then you could eliminate all of the short words then.
      Just a thought.. =)

      Regards Paul

      Update:
      You still would have to compare the 'signatures' which would still be very time consuming, but the only way I can see around this with this method is say you use this header that is generated and pick the top 10 most used chunks (as it can be words or phrases), and then sort them alphabetically. Then store your document in directories and subdirectories based off each word. i.e.

      \stars\moon\rocks\
      has RocksoftheMoon.doc and GanymedeGeology.doc
      for instance....
      This way you evaluate you document when you get it and store it in a specific place and like documents just end up in the same directory....or at least nearby..

      Regards Paul