Re^2: Module for Approximate File Comparison

You mean something like:

#!/usr/bin/perl

use strict;
use warnings;

use IO::Compress::Gzip qw(gzip);

my ( $f1, $f2 ) = @ARGV[ 0 .. 1 ];
my %comparison;

# Get file contents, and compute gzipped lengths.
for my $file ( $f1, $f2 ) {
    get_content( $file, \%{ $comparison{$file} } );
    computations( \%{ $comparison{$file} } );
}

# Try both orderings of files to determine which produces 
# better results. The content is not needed after lengths 
# have been computed.
$comparison{test1}{content} =
  $comparison{$f1}{content} . $comparison{$f2}{content};
computations( \%{ $comparison{test1} } );
delete $comparison{test1}{content};
$comparison{test2}{content} =
  $comparison{$f2}{content} . $comparison{$f1}{content};
computations( \%{ $comparison{test2} } );
delete $comparison{test2}{content};

# The original content is no longer needed.
foreach my $file ( $f1, $f2 ) {
    delete $comparison{$file}{content};
}

printf qq{Comparison: %5.2f%% similarity\n},
  compute_ratio( \%comparison, ( $f1, $f2 ) );

sub order_pair {
    my ( $i, $j ) = @_;
    my ( $min, $max ) = ( $i, $j );
    ( $min, $max ) = ( $max, $min ) if ( $max < $min );
    return ( $min, $max );
}

sub compute_ratio {
    my ( $hash, @fl ) = @_;

    my ( $min, $min_result, $max, $ratio );

    ( $min_result, undef ) = order_pair(
        $hash->{test1}{gzip_length},
        $hash->{test2}{gzip_length}
    );
    ( $min, $max ) = order_pair(
        $hash->{ $fl[0] }{gzip_length},
        $hash->{ $fl[1] }{gzip_length}
    );

    # Files are 100% if they match exactly. 
    #   (a + b - a) / b = b / b = 1
    # Files are 0% if they do not match exactly.
    #   (a + b - (a+b)) / b = 0 / b = 0
    # Ratio computed as how close to the minimal size

    $ratio = 100.0 * ( $min + $max - $min_result ) / $max;
    return $ratio;
}

sub computations {
    my ($hash) = @_;

    # Compute gzipped length of content
    $hash->{length} = length $hash->{content};
    gzip \$hash->{content}, \$hash->{compressed};
    # Gzipped content not needed 
    #   after length has been computed.
    $hash->{gzip_length} = length $hash->{compressed};
    delete $hash->{compressed};
}

sub get_content {
    my ( $fn, $hash ) = @_;

    # Slurp in file content in bin mode
    open( DF, $fn ) or die $!;
    binmode DF;
    $/ = undef;
    $hash->{content} = <DF>;
    close DF;
}
[download]

The code above uses IO::Compress::Gzip, and takes the two files as arguments on the command line. The ratio in this case is computed by saying that 100% would be if the content of the two files compressed to be the same size as one of them, 0% as compressing to the sum of the sizes of the two files.

(By the way, for the example files from my earlier post, this gave a result of 90.61%, which is a fair approximation.)

Update: 2007-09-24
Added comments to code; had reversed description of ratio.

Comment on Re^2: Module for Approximate File Comparison Download Code

Replies are listed 'Best First'.
Re^3: Module for Approximate File Comparison by salva (Canon) on Sep 24, 2007 at 19:46 UTC
If I recall correctly, zlib (and so, gzip) operate over a data window of 64Kb or less, so your method would not work if the files are bigger than that.	[reply]
Re^4: Module for Approximate File Comparison by atcroft (Abbot) on Sep 24, 2007 at 23:46 UTC
True-any compression method that uses a data window in a similar manner would experience the same issue, albeit the size at which it would occur would be dependent upon the size of said window. The code above was merely an example.	[reply]