File Similarity Concept

Proof of concept -- sparked by the discussion in Similarity measurement. NB that this is NOT PRODUCTION GRADE CODE nor does it resolve all issues with the loose (aka 'incomplete') spec in the OP)

#! /usr/bin/perl -w
use 5.018;

# 1123881a.pl (cf  [id=1123881])
# $file1 is used in place of OP's text file1; DATA stands in for text 
+file 2
# perl 5, version 18, subversion 4 (v5.18.4) built for MSWin32-x86-mul
+ti-thread-64int
# ...
# Binary build 1804 [298913] provided by ActiveState http://www.Active
+State.com
# Built Mar 19 2015 17:49:00

my (@F1, %F1, @F2, %F2);

my $file1 = "Now is the time for the quick red fox to jump over the la
+zy brown dog's spooon while all good men run away with the fork and c
+ome to the aid of their country";   
chomp $file1;
@F1 = split / /, $file1;              # individual words

my $file2 = <DATA>; 
chomp $file2;
@F2 = split / /, $file2; 

$F1{$_}++ for @F1;                   # produces hash with key::value p
+airs word => count for each word
$F2{$_}++ for @F2;

say "\n\t --- Ln 25 Printing keys and values for the HASHES, \%F1 and 
+\%F2\n\t\t ...and creating ARRAYS \@F1combined and \@F2combined.";
my (@F1combined, @F2combined);     # 
while ( my ($key, $value) = each(%F1) ) {
        print "$key => $value\n";
        push @F1combined, ($key . ' => ' . $value);
}
say "\n\t --- \%F2, next: ---";
while ( my ($key, $value) = each(%F2) ) {
        print "$key => $value\n";
        push @F2combined, ($key . ' => ' . $value);
}

my @sort_arr1 = sort {fc($a) cmp fc($b)} @F1combined;    # fc to norma
+lize
my @sort_arr2 = sort {fc($a) cmp fc($b)} @F2combined;


my $entry;                      # a complete element of an array, @sor
+t_arr1 in this case. See Ln 54
my $counter = qr/ => \d+/;      # the part of of the element we'll exc
+lude in Ln 22-23 (so can match words w/variant counts)
my $word;                       # search term for the word only, less 
+the fat arrow and counter; see Ln 22-23
my $match_count = 0;
my $mismatch=0;
my $len1 = $#sort_arr1;         # used to determine the terminal state
+ of the loop at Ln 50
my $len2 = $#sort_arr2;

my $item_count = ($len1 > $len2) ? ($len1+1) : ($len2+1);   # Longer o
+f the two arrays (files) ... 
say "\t\t \$item_count: $item_count";                       # which ca
+uses "uninit" warnings at Ln 55 et seq.

my $i;
for ( $i=0; $i<($item_count); $i++) {
    my $entry = $sort_arr1[$i];
    chomp $entry;
    say ">> Ln 56    \$i: $i   |$entry| ";   # can be used for DEBUG
    if ( $entry =~ /(\w+)$counter/i ) {
        $word = $1;
    } else {
        next;
    }

    if ( grep {/$word/} @sort_arr2 ) {
        say "\t found |$word| in both arrays (files) \n"; 
        $match_count++;
    } else {
        say "\t didn't match entry, |$entry| \n";
        $mismatch++;
    }
}

say "\n\t \$match_count: $match_count";
say "\t \$mismatch: $mismatch";

my $element_total = $match_count+$mismatch;
say "\n\t SLOPPY SPEC: among other issues, does not treat cases where 
+the number of instances of a word in one file 
\t is different than the number of instances in the second file as a m
+ismatch (eg. if the word is in both, even 
\t though in differing quanties, it's treated as a match.";
say "\t No allowance made for use with arrays having different numbers
+ of elements (variance produces 'uninitialized' warnings).\n";
say "\n\t Here's one measure of SIMILARITY (using matchs/total element
+s evaled): " . $match_count/$item_count;
say "\n\t Another uses the total of matches and mismatches as the divi
+sor: " . $match_count/$element_total;
say "\n\t Magnitude of DIS-similarty (using the ratio of mismatches/ma
+tches) : " . $mismatch/$match_count;
say "\n\t By the same sloppy spec, but using mismatch/elements_in_firs
+t_array): ". $mismatch/($#sort_arr1 + 1);

__DATA__
now is the time for all good men to come to the aid of their country w
+hile the quick red fox jumps over the lazy brown dog's back and the f
+ork runs away with the spoon
[download]

Output for review is in the readmore

D:\> 1123881a.pl

         --- Ln 25 Printing keys and values for the HASHES, %F1 and %F
+2
                 ...and creating ARRAYS @F1combined and @F2combined.
brown => 1
all => 1
their => 1
country => 1
is => 1
fox => 1
aid => 1
while => 1
to => 2
time => 1
for => 1
Now => 1
spooon => 1
dog's => 1
of => 1
away => 1
and => 1
with => 1
over => 1
run => 1
the => 5
quick => 1
fork => 1
red => 1
come => 1
men => 1
good => 1
lazy => 1
jump => 1

         --- %F2, next: ---
brown => 1
all => 1
runs => 1
jumps => 1
their => 1
country => 1
back => 1
is => 1
fox => 1
aid => 1
time => 1
while => 1
to => 2
for => 1
spoon => 1
dog's => 1
of => 1
away => 1
now => 1
with => 1
the => 6
and => 1
over => 1
fork => 1
quick => 1
red => 1
come => 1
men => 1
lazy => 1
good => 1
                 $item_count: 30
# some output, similar to the following, has been deleted for the sake
+ of brevity
# (even so, brevity is not present in abundance)
...
>> Ln 56    $i: 22   |spooon => 1|
         didn't match entry, |spooon => 1|

>> Ln 56    $i: 23   |the => 5|              # No allowance for varian
+ce in occurences, 5 in file1 and 6 in DATA
         found |the| in both arrays (files)  # Part of the loose spec;
+ should this count for similarity or dis-similarity?

>> Ln 56    $i: 24   |their => 1|
         found |their| in both arrays (files)

>> Ln 56    $i: 25   |time => 1|
         found |time| in both arrays (files)

>> Ln 56    $i: 26   |to => 2|
         found |to| in both arrays (files)

...

Use of uninitialized value $entry in scalar chomp at D:\_Perl_\pl_test
+\1123881a.pl line 55, <DATA> line 1.
Use of uninitialized value $entry in concatenation (.) or string at D:
+\_Perl_\pl_test\1123881a.pl line 56, <DATA
>> Ln 56    $i: 29   ||
Use of uninitialized value $entry in pattern match (m//) at D:\_Perl_\
+pl_test\1123881a.pl line 57, <DATA> line 1

         $match_count: 27
         $mismatch: 2

         SLOPPY SPEC: (AMONG OTHER ISSUES) does not treat cases where 
+the number of instances of a word in one file
         is different than the number of instances in the second file 
+as a mismatch (eg. if the word is in both, though in differing quanti
+es, it's treated as a match.
         No allowance made for use with arrays having different number
+s of elements (variance produces 'uninitialised" warnings


         Here's one measure of SIMILARITY (using matchs/total elements
+ evaled): 0.9

         Another uses the total of matches and mismatches as the divis
+or: 0.931034482758621

         Magnitude of DIS-similarty (using the ratio of mismatches/mat
+ches) : 0.0740740740740741

         By the same sloppy spec, but using mismatch/elements_in_first
+_array): 0.0689655172413793
[download]

Hope you find this interesting.

Spirit of the Monastery

++$anecdote ne $data

Comment on File Similarity Concept Select or Download Code

Replies are listed 'Best First'.
Re: File Similarity Concept by FreeBeerReekingMonk (Deacon) on May 14, 2015 at 15:21 UTC
I'm not drunk enough to fully understand the code, but the math called me. `Perl v5.18.0 required--this is only v5.14.2, stopped at q line 2.` :( I feel left out, because your code does not run on my system (an up-to-date Debian Stable release), it seems all you do is use "say", which means you should change `use 5.018;` into `use v5.10;` And it runs for me too.. :) Certainly interesting. Going beyond a mere wordcount. You also may want to throw in a `undef $/;` to slurp in more than the first line (or did you plan to enhance the algorithm to seek inserted lines, a bit what diff does)? You can also count the hash like this `$F1{$_}++ for @F1; $F1{$_}-- for @F2;` [download] if equal it is 0, if positive, the second file was missing it (or one ocurrance), if negative... well.. And depending on the farness from 0 you can build your stats. Takes less mem too.	[reply] [d/l] [select]
Re^2: File Similarity Concept by ww (Archbishop) on May 14, 2015 at 16:47 UTC
Thank You for the reply and the thought you put into it. I'm taking your suggestions re slurping and the hash variant under serious consideration (and not responding directly to those, right now, because first I have to be sure I didn't miss something; that I understand your intent; and that I know how and where to implement them). As to your point re specifying the v. in `use5.018`, I understand but choose to post with info for the reader on just what I used to run the script. While a downward revision might be 'kind' (as in "changing it so as to save another Monk the trouble of doing so") but might sometimes leave that individual without the info re what v. I used and would always incure extra work for me. Update: fixed in para 2: s/reading/reader/ Spirit of the Monastery	[reply] [d/l] [select]
Re^3: File Similarity Concept by FreeBeerReekingMonk (Deacon) on May 18, 2015 at 21:41 UTC
I can clarify that intent. Consider this: `my $file2 = <DATA>; chomp $file2; die $file2; __DATA__ The quick brown fox` [download] It yields: `The quick at data.pl line 5, <DATA> line 1.` [download] where as `undef $/; my $file2 = <DATA>; chomp $file2; die $file2; __DATA__ The quick brown fox` [download] yields: `The quick brown fox` [download] I understand your dillema with versions, and of course, you are free to do so. As for the counting: I suggested to count all words in one file as positives, and all words in the other file as negative. Thus, if the word "the" has the same occurrance in both files, then the value for that word in the hash will be zero. And either positive or negative if it occurs more than n time in one of them.	[reply] [d/l] [select]