Give
Text::Similarity::Overlaps a try. For example:
#!/usr/bin/perl -l
use strict;
no strict 'refs';
use warnings;
use Text::Similarity::Overlaps;
my( %opt ) = (
verbose => 1,
Text::Similarity::NORMALIZE => 1,
);
my $mod = Text::Similarity::Overlaps->new( \%opt );
die "$mod failed" unless defined $mod;
my $file1 =
"/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm";
my $file2 =
"/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm";
open $file1, '<', \*STDOUT or die $!;
binmode STDOUT, ":encoding(UTF-8)";
open $file2, '<', \*STDOUT or die $!;
binmode STDOUT, ":encoding(UTF-8)";
my $score = $mod->getSimilarity( $file1, $file2 );
print "The similarity of $file1 and file2 is: $score";
close( $file1 );
close( $file2 );
It'll take a few minutes, but it comes back with a score.
In this case, the result was:
0.999615754082613 for
two files exactly the same.
For two completely different files:
#!/usr/bin/perl -l
use strict;
no strict 'refs';
no warnings::anywhere qw(uninitialized);
use Text::Similarity::Overlaps;
use warnings qw(uninitialized);
my( %opt ) = (
verbose => 1,
Text::Similarity::NORMALIZE => 1,
);
my $mod = Text::Similarity::Overlaps->new( \%opt );
die "$mod failed" unless defined $mod;
my $file1 =
"/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm";
my $file2 =
"/usr/local/lib/perl5/site_perl/5.10.0/POE.pm";
open $file1, '<', \*STDOUT or die $!;
binmode STDOUT, ":encoding(UTF-8)";
open $file2, '<', \*STDOUT or die $!;
binmode STDOUT, ":encoding(UTF-8)";
my $score = $mod->getSimilarity( $file1, $file2 );
print "The similarity of the two files is: $score";
close( $file1 );
close( $file2 );
The smilarity score for two completely different files came back at:
0.345969033635878