http://qs1969.pair.com?node_id=393397

Cap'n Steve has asked for the wisdom of the Perl Monks concerning the following question:

What I'm looking for is something to compare two files, specifically images. Most hashing algorithms are designed to generate vastly different hashes with only slight differences in the source files, but I need the exact opposite, something that will tell me how similar two files are.

Does anything like this already exist? Could someone point me to a good resource for finding out about this sort of thing?
  • Comment on I need a comparison/hashing algorithm (not the usual).

Replies are listed 'Best First'.
Re: I need a comparison/hashing algorithm (not the usual).
by Zaxo (Archbishop) on Sep 24, 2004 at 06:13 UTC

    This is crude, but you can try zipping them separately for comparison and zipping their concatenation. If they are similar, you'll get a higher compression ratio than if not.

    Another possibility to to apply a fast fourier transform to each and multiply them pointwise. That will give the fourier transform of their correlation function. That has the potential to give very precise results, but maybe hard to interpret. Similar files will produce a strongly peaked correlation function.

    If you want to try the second route, PDL is the way.

    After Compline,
    Zaxo

Re: I need a comparison/hashing algorithm (not the usual).
by tachyon (Chancellor) on Sep 24, 2004 at 07:49 UTC

    This is an interesting problem, and how you tackle it depends on your requirements. The main consideration is do you just need to do 1 to 1 comparisons or is it a 1 to many or even many to many problem. The one to one, or one to many can be solved with brute force so I won't bother discussing them. Many to many gets interesting due to the impossibility of comparing each file to every other file which is O(n!).

    Probably the most research on this problem has been done in the context of Web pages and near duplicate pages, the logic is however applicable to any byte stream. Keywords for your Google search are qw( Broder Rabin Fingerprint Shingle Large Collection ) or some such. Here are a couple of good links:

    http://research.microsoft.com/research/sv/PageTurner/similarity.htm
    On the resemblance and containment of documents - Andrei Z Broder (PDF)

    The basic approach is quite simple. We wish to store the 'essence' of our document/image in such a way that we can compare future documents with it. To do so we make a sketch. We do this as follows:

    I am a document containing some logical chunks ..... SHINGLE IT LIKE THIS I am a document am a document containing a document containing some document containing some logical containing some logical chunks ..... HASH THE SHINGES 1ba00000 f8840000 e19a0000 c6ca0000 82ce0000 ..... SORT THE SHINGLES SUCH THAT ..... WE CAN TAKE A PSEUDORANDOM SELECTION OF THEM, SAY THE FIRST N ~ +20 1ba00000 <-- lets take this first one 82ce0000 <-- and this next one c6ca0000 <-- and this one, but discard rest..... e19a0000 f8840000 .....

    So what does this process achieve. Essentially it is a reliable method to select random indicative chunks of a document. These form a sketch of that document and can be stored in a DB that maps hash to doc_id. The bits we select are only pseuodrandom as we will get the same set for identical documents and a similar set for similar documents. Hashing the shingles gives us the dual benefits of condensing the data storage for each shingle and allowing us to use sort to make a fair random selection.

    The mathematics and proofs are outlined in the Broder paper but in short similar things will have a significant number of these pseudorandomly selected hashes in common. Dissimilar things may have 0,1, 2, ... but the probability of them having a significant number is small. If we have stored the hashes in a DB then each document can be checked against all others in near constant time.

    Images are in many ways easier as you can just take N sequential bytes as a chunk. You still need to use a hashing stage to make sure you get pseudorandom tokens for your sketch. Although in some ways images are easier than a text/html stream there are other problems unique to them. To fix issues with a trivial change in scaling/contrast/brightness/gamma etc breaking matching you could normalise all images in various ways before applying the shingle algorithm. This should (in theory) allow you to find a match between say a thumbnail and the parent image. You could also potentially find images that have been cropped as well.

    You can do all this in Perl but there are very strong arguments for using C for the guts of it. AFAIK chunks of this algorithm are subject to DEC/Altavista patent(s).

    cheers

    tachyon

Re: I need a comparison/hashing algorithm (not the usual).
by TedPride (Priest) on Sep 24, 2004 at 06:12 UTC
    Hmm, this could be very easy or very difficult depending on how good the "enemy" algorithm is. What image format is being used? Are random bits being turned on and off, or are lines being drawn across the whole image? Is there fluff material at the edges or corners? Etc. The human brain is far better at pattern matching than any algorithm ever will be.
Re: I need a comparison/hashing algorithm (not the usual).
by cheshirecat (Sexton) on Sep 24, 2004 at 16:58 UTC
    Another theoretical approach would be to compress both images into JPEGs and then parse the JPEG file for the 8x8 chunks

    Calculate a hash for each of the chunks and then you could compare the list of hashes for each image.

    Alternatively you could break each image up into 8x8 chunks and then "naively" quantise them (take it down to 12bit or even 9 bit colour depth).
    Then calculate the average and median for each of the quantised chunks and compare the lists of values for each image.

    Cheers The Cat

      Wow, there's definitely some good ideas there, thanks guys. This is all theoretical at the moment so I'm just investigating the possibilities, but I'll be sure to read up on some of the suggestions.
Re: I need a comparison/hashing algorithm (not the usual).
by Anonymous Monk on Sep 24, 2004 at 12:22 UTC
    If you were using text, you could try Digest::Nilsimsa. Small changes in a text file won't change the hash.
Re: I need a comparison/hashing algorithm (not the usual).
by BrowserUk (Patriarch) on Sep 24, 2004 at 19:43 UTC

    This is fairly simplistic and quite slow as is, but it does a very good job of matching inverted, flipped and rotated images; pretty good on resized images; and it seems fairly effective on similar images in different formats.

    It's currently not good at cropped; negative or greyscaled images, though 24-bit <-> 8-bit colour transforms in either direction seem to matched quite well.

    The performance can be improved markedly by using PDL or similar.

    #! perl -slw use strict; use Data::Dumper; use List::Util qw[ reduce sum max min ]; use GD; sub check { my $img = GD::Image->new( shift() ); my( $iMin, $iMax, %freq ) = ( 0, 0 ); my( $xMax, $yMax ) = $img->getBounds; for my $y ( 0 .. $yMax - 1 ) { for my $x ( 0 .. $xMax - 1 ) { my $i = reduce{ ($a<<8) + $b;} $img->rgb( $img->getPixel( +$x, $y ) ); $freq{ $i }++; $iMin = $iMin < $i ? $iMin : $i; $iMax = $iMax > $i ? $iMax : $i; } } my $count = $xMax * $yMax; my $ex = 16777216 / ( $iMax - $iMin ); my %norm; while( my( $i, $f ) = each %freq ) { $norm{ ( ( $i - $iMin ) * $ex ) / 16777216 } = $f / $count; } return 100 * sum map{ $_ * $norm{ $_ } } keys %norm; } my %sums = map{ check( $_ ), $_ } map glob, @ARGV; printf "%32s : %7.5f\n", $sums{ $_ }, $_ for sort { $a <=> $b } keys % +sums; __END__

    I'd be interested to see pairs of disparate images that produce very close checksums--assuming you are comfortable sharing them:)

    This isn't based upon any other algorithms that I am aware of--just made up as I went along--but if anyone see's a relationship between this and prior art I appreciate references.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: I need a comparison/hashing algorithm (not the usual).
by Boots111 (Hermit) on Sep 25, 2004 at 14:45 UTC
    Cap'n Steve~

    Have you consider using a binary equivalent of the diff algorithm. I know that it exists and can be implemented highly effeciently. This will give you not only an idea of what parts of the image are identical, but which parts differ. You could modify the part of the algorithm that checks for "identical"-ness and use your own custom similarity function, which would allow you to control the results a bit more.

    You might want to perform the diff in the pixel space of the images rather than the binary space of the encoding, but this should also be a relatively painless change...

    Boots
    ---
    Perhaps one did not want to be loved so much as to be understood.
    -1984
Re: I need a comparison/hashing algorithm (not the usual).
by parv (Parson) on Sep 27, 2004 at 20:06 UTC