Similarity measurement

kennedy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Similarity measurement by hdb (Monsignor) on Apr 18, 2015 at 14:02 UTC
It really depends on the objective. You could count the occurrence of each word and compare frequencies. They could be identical but the text files could be completely different. A single 'not' could change the meaning of a text into its opposite. The 'diff' tool is useful to compare text files but mainly based on lines. You could consider your text file as a sequence of words and then apply a longest common subsequence algorithm as in Algorithm::Diff.	[reply]
Re: Similarity measurement by ww (Archbishop) on Apr 19, 2015 at 03:34 UTC
As hdb told you, "(i)t really depends...." Here are a just few more ways that "it...depends." Are two files with variant counts of a given word "exactly the same?" Does a pair of files in which one has certain word capitalized and the other has that word but with all chars in all lower case satisfy your spec? How do you wish to categorize a pair of files in which two different forms of a particular word occur; as, for example, when one file has a word hyphenated (at the end of a line) and the other file has the word in a position where it is not hyphenated. Despite the wisdom offered by others (above), I suspect this is do-able. I suspect one might tackle this problem by using sorted arrays of the two files, and testing for matches by position (which will require the test to skip over every instance after the first of any word which appears multiple times in one or the other files, or a word which appears multiple times in both... but a different number of multiple times. Another possibility which might be worth exploring would be to use hashes to count instances of each word in each file (and perhaps cast those to a second set of sorted arrays where each array element has the word and count (key and value pairs) from the hash) and then...? Well, one could use a regex to compare ( by position) the word (key) elements in the second set of sorted arrays and decide -- accounting for case or count or both or neither -- if you'll accept a pair as "exactly the same" or not. The arithmetic for determining the similarity percentage is left as an exercise for the OP [ :-) ] ... or, someone with better brains or more free time than I have at the moment. UPDATE: 0740 EDT 20150419: Found in C:\Perl\lib\pods\perlfaq4.pod How do I test whether two arrays or hashes are equal? With Perl 5.10 and later, the smart match operator can give you the answer with the least amount of work: use 5.010; if( @array1 ~~ @array2 ) { say "The arrays are the same"; } if( %hash1 ~~ %hash2 ) # doesn't check values! { # <- !!! say "The hash keys are the same"; } .... Sometimes, a fresh dawn and fresh coffee are helpful in finding the obvious. Update2 1230 EDT 20150421: See File Similarity Concept for a proof of concept Spirit of the Monastery	[reply]
Re: Similarity measurement by Khen1950fx (Canon) on Apr 18, 2015 at 22:33 UTC
Give Text::Similarity::Overlaps a try. For example: #!/usr/bin/perl -l use strict; no strict 'refs'; use warnings; use Text::Similarity::Overlaps; my( %opt ) = ( verbose => 1, Text::Similarity::NORMALIZE => 1, ); my $mod = Text::Similarity::Overlaps->new( \%opt ); die "$mod failed" unless defined $mod; my $file1 = "/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm"; my $file2 = "/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm"; open $file1, '<', \STDOUT or die $!; binmode STDOUT, ":encoding(UTF-8)"; open $file2, '<', \STDOUT or die $!; binmode STDOUT, ":encoding(UTF-8)"; my $score = $mod->getSimilarity( $file1, $file2 ); print "The similarity of $file1 and file2 is: $score"; close( $file1 ); close( $file2 ); [download] It'll take a few minutes, but it comes back with a score. In this case, the result was: `0.999615754082613` for two files exactly the same. For two completely different files: #!/usr/bin/perl -l use strict; no strict 'refs'; no warnings::anywhere qw(uninitialized); use Text::Similarity::Overlaps; use warnings qw(uninitialized); my( %opt ) = ( verbose => 1, Text::Similarity::NORMALIZE => 1, ); my $mod = Text::Similarity::Overlaps->new( \%opt ); die "$mod failed" unless defined $mod; my $file1 = "/usr/lib/perl5/5.10.0/i386-linux-thread-multi/Encode.pm"; my $file2 = "/usr/local/lib/perl5/site_perl/5.10.0/POE.pm"; open $file1, '<', \STDOUT or die $!; binmode STDOUT, ":encoding(UTF-8)"; open $file2, '<', \STDOUT or die $!; binmode STDOUT, ":encoding(UTF-8)"; my $score = $mod->getSimilarity( $file1, $file2 ); print "The similarity of the two files is: $score"; close( $file1 ); close( $file2 ); [download] The smilarity score for two completely different files came back at: `0.345969033635878`	[reply] [d/l] [select]
Re: Similarity measurement by Marshall (Canon) on Apr 18, 2015 at 23:48 UTC
This is a very complicated problem. Very complicated. One technique is Text::Levenshtein - calculate the Levenshtein edit distance between two strings. This is not easy even for a human to do. There are many language differences in syntax that this is almost impossible for a computer to do it.	[reply]
Re: Similarity measurement by FreeBeerReekingMonk (Deacon) on Apr 19, 2015 at 06:16 UTC
comm comes to mind, but this is a perl forum, so here is a my shot at it, although it is flawed, as permutated lines do not get registered as a difference. Nor do extra repeated lines: `perl -ne '$lines++;$common++ if ($seen{$_} .= @ARGV) =~ /10$/; END{pri +ntf("%.2f",$common/$lines200)}' file1.txt file2.txt` [download] Not sure who to give credit to... here is the source: http://www.cyberciti.biz/faq/command-to-display-lines-common-in-files/ It extends to 3 files if you use "=~/210$/". Another way would be parse the output of diff with perl. But Im not getting the correct syntax for that. Edit: Thanks! Here is my own spin: `perl -ne '$lines++;$common++ if ($seen{$_} .= @ARGV) eq "10"; END{prin +tf("%.2f",$common/$lines200)}' file1.txt file2.txt` [download]	[reply] [d/l] [select]
Re^2: Similarity measurement by Happy-the-monk (Canon) on Apr 19, 2015 at 06:42 UTC
Not sure who to give credit to... here is the source Credit goes to mu, the source says. Cheers, Sören Créateur des bugs mobiles - let loose once, run everywhere. (hooked on the Perl Programming language)	[reply]
Re: Similarity measurement by oiskuu (Hermit) on Apr 20, 2015 at 05:39 UTC
The question is vague as to the intent and scope of the problem. The best I could surmise is this might relate to plagiarism detection. How literally do you mean the "based on word content"? One rudimentary approach is to try compress the text units first separately, then together (as a "solid archive"), for an estimate of entropy. #! /usr/bin/perl use strict; use warnings; package PerlIO::via::Count { sub PUSHED { bless [], shift } sub WRITE { $_[0][0] += length $_[1]; length $_[1] } sub TELL { $_[0][0] } } use IO::Compress::Lzma; sub zce { my ($s, $t) = map { open my $cnt, ">:via(Count)", \my $null; [ $cnt, new IO::Compress::Lzma $cnt ] } 0, 0; for my $file (@_) { local $/ = \8192; open my $fh, '<', $file; while (<$fh>) { $s->[1]->write($_); $t->[1]->write($_); } $t->[1]->newStream; } $_ = ( $_->[1]->close, $_->[0]->tell ) for ($s, $t); printf "%3.0f%% %s\n", 100.0 * (1-$s/$t) / (1-1/@_), join ' ', @_; } (@ARGV = grep -f, @ARGV) > 1 or exit; # zce(@ARGV[$_-1,$_]) for 1..$#ARGV; zce(@ARGV); [download] Just something to toy with. There might be a better way to get at the compressed size.	[reply] [d/l]