Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

This is an interesting problem, and how you tackle it depends on your requirements. The main consideration is do you just need to do 1 to 1 comparisons or is it a 1 to many or even many to many problem. The one to one, or one to many can be solved with brute force so I won't bother discussing them. Many to many gets interesting due to the impossibility of comparing each file to every other file which is O(n!).

Probably the most research on this problem has been done in the context of Web pages and near duplicate pages, the logic is however applicable to any byte stream. Keywords for your Google search are qw( Broder Rabin Fingerprint Shingle Large Collection ) or some such. Here are a couple of good links:
On the resemblance and containment of documents - Andrei Z Broder (PDF)

The basic approach is quite simple. We wish to store the 'essence' of our document/image in such a way that we can compare future documents with it. To do so we make a sketch. We do this as follows:

I am a document containing some logical chunks ..... SHINGLE IT LIKE THIS I am a document am a document containing a document containing some document containing some logical containing some logical chunks ..... HASH THE SHINGES 1ba00000 f8840000 e19a0000 c6ca0000 82ce0000 ..... SORT THE SHINGLES SUCH THAT ..... WE CAN TAKE A PSEUDORANDOM SELECTION OF THEM, SAY THE FIRST N ~ +20 1ba00000 <-- lets take this first one 82ce0000 <-- and this next one c6ca0000 <-- and this one, but discard rest..... e19a0000 f8840000 .....

So what does this process achieve. Essentially it is a reliable method to select random indicative chunks of a document. These form a sketch of that document and can be stored in a DB that maps hash to doc_id. The bits we select are only pseuodrandom as we will get the same set for identical documents and a similar set for similar documents. Hashing the shingles gives us the dual benefits of condensing the data storage for each shingle and allowing us to use sort to make a fair random selection.

The mathematics and proofs are outlined in the Broder paper but in short similar things will have a significant number of these pseudorandomly selected hashes in common. Dissimilar things may have 0,1, 2, ... but the probability of them having a significant number is small. If we have stored the hashes in a DB then each document can be checked against all others in near constant time.

Images are in many ways easier as you can just take N sequential bytes as a chunk. You still need to use a hashing stage to make sure you get pseudorandom tokens for your sketch. Although in some ways images are easier than a text/html stream there are other problems unique to them. To fix issues with a trivial change in scaling/contrast/brightness/gamma etc breaking matching you could normalise all images in various ways before applying the shingle algorithm. This should (in theory) allow you to find a match between say a thumbnail and the parent image. You could also potentially find images that have been cropped as well.

You can do all this in Perl but there are very strong arguments for using C for the guts of it. AFAIK chunks of this algorithm are subject to DEC/Altavista patent(s).



In reply to Re: I need a comparison/hashing algorithm (not the usual). by tachyon
in thread I need a comparison/hashing algorithm (not the usual). by Cap'n Steve

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (2)
As of 2023-03-31 06:16 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (75 votes). Check out past polls.