Urk. Again, any approach that requires comparing two documents directly is going to eat me alive, as I have thousands of these every day. N-squared, y'know.
To restate my desired outcome: I can checksum each document, and then get a zero-or-one answer by comparing checksums. But what I really want is a "fuzzy" checksum, kind of like taking a thumbnail of an image and comparing the thumbnails. That led me to the approach of throwing out all the short words, whitespace, punctuation, etc. and checksumming the resulting string.
| -- |
| Jeff Boes |
| Database Engineer |
| Nexcerpt, Inc. |
|
|
|
...Nexcerpt...Connecting People With Expertise
|