http://qs1969.pair.com?node_id=393412


in reply to I need a comparison/hashing algorithm (not the usual).

This is an interesting problem, and how you tackle it depends on your requirements. The main consideration is do you just need to do 1 to 1 comparisons or is it a 1 to many or even many to many problem. The one to one, or one to many can be solved with brute force so I won't bother discussing them. Many to many gets interesting due to the impossibility of comparing each file to every other file which is O(n!).

Probably the most research on this problem has been done in the context of Web pages and near duplicate pages, the logic is however applicable to any byte stream. Keywords for your Google search are qw( Broder Rabin Fingerprint Shingle Large Collection ) or some such. Here are a couple of good links:

http://research.microsoft.com/research/sv/PageTurner/similarity.htm
On the resemblance and containment of documents - Andrei Z Broder (PDF)

The basic approach is quite simple. We wish to store the 'essence' of our document/image in such a way that we can compare future documents with it. To do so we make a sketch. We do this as follows:

I am a document containing some logical chunks ..... SHINGLE IT LIKE THIS I am a document am a document containing a document containing some document containing some logical containing some logical chunks ..... HASH THE SHINGES 1ba00000 f8840000 e19a0000 c6ca0000 82ce0000 ..... SORT THE SHINGLES SUCH THAT ..... WE CAN TAKE A PSEUDORANDOM SELECTION OF THEM, SAY THE FIRST N ~ +20 1ba00000 <-- lets take this first one 82ce0000 <-- and this next one c6ca0000 <-- and this one, but discard rest..... e19a0000 f8840000 .....

So what does this process achieve. Essentially it is a reliable method to select random indicative chunks of a document. These form a sketch of that document and can be stored in a DB that maps hash to doc_id. The bits we select are only pseuodrandom as we will get the same set for identical documents and a similar set for similar documents. Hashing the shingles gives us the dual benefits of condensing the data storage for each shingle and allowing us to use sort to make a fair random selection.

The mathematics and proofs are outlined in the Broder paper but in short similar things will have a significant number of these pseudorandomly selected hashes in common. Dissimilar things may have 0,1, 2, ... but the probability of them having a significant number is small. If we have stored the hashes in a DB then each document can be checked against all others in near constant time.

Images are in many ways easier as you can just take N sequential bytes as a chunk. You still need to use a hashing stage to make sure you get pseudorandom tokens for your sketch. Although in some ways images are easier than a text/html stream there are other problems unique to them. To fix issues with a trivial change in scaling/contrast/brightness/gamma etc breaking matching you could normalise all images in various ways before applying the shingle algorithm. This should (in theory) allow you to find a match between say a thumbnail and the parent image. You could also potentially find images that have been cropped as well.

You can do all this in Perl but there are very strong arguments for using C for the guts of it. AFAIK chunks of this algorithm are subject to DEC/Altavista patent(s).

cheers

tachyon