in reply to File Similarity Concept

I'm not drunk enough to fully understand the code, but the math called me.
Perl v5.18.0 required--this is only v5.14.2, stopped at q line 2.

:(

I feel left out, because your code does not run on my system (an up-to-date Debian Stable release), it seems all you do is use "say", which means you should change
use 5.018;
into
use v5.10;
And it runs for me too..

:)

Certainly interesting. Going beyond a mere wordcount.
You also may want to throw in a undef $/; to slurp in more than the first line (or did you plan to enhance the algorithm to seek inserted lines, a bit what diff does)?
You can also count the hash like this

$F1{$_}++ for @F1; $F1{$_}-- for @F2;

if equal it is 0, if positive, the second file was missing it (or one ocurrance), if negative... well.. And depending on the farness from 0 you can build your stats. Takes less mem too.

Replies are listed 'Best First'.
Re^2: File Similarity Concept
by ww (Archbishop) on May 14, 2015 at 16:47 UTC

    Thank You for the reply and the thought you put into it.

    I'm taking your suggestions re slurping and the hash variant under serious consideration (and not responding directly to those, right now, because first I have to be sure I didn't miss something; that I understand your intent; and that I know how and where to implement them).

    As to your point re specifying the v. in use5.018, I understand but choose to post with info for the reader on just what I used to run the script. While a downward revision might be 'kind' (as in "changing it so as to save another Monk the trouble of doing so") but might sometimes leave that individual without the info re what v. I used and would always incure extra work for me.

    Update: fixed in para 2: s/reading/reader/

      I can clarify that intent. Consider this:

      my $file2 = <DATA>; chomp $file2; die $file2; __DATA__ The quick brown fox

      It yields:

      The quick at data.pl line 5, <DATA> line 1.

      where as

      undef $/; my $file2 = <DATA>; chomp $file2; die $file2; __DATA__ The quick brown fox

      yields:

      The quick brown fox

      I understand your dillema with versions, and of course, you are free to do so.

      As for the counting: I suggested to count all words in one file as positives, and all words in the other file as negative. Thus, if the word "the" has the same occurrance in both files, then the value for that word in the hash will be zero. And either positive or negative if it occurs more than n time in one of them.