As hdb told you, "(i)t really depends...."
Here are a just few more ways that "it...depends."
Despite the wisdom offered by others (above), I suspect this is do-able.
I suspect one might tackle this problem by using sorted arrays of the two files, and testing for matches by position (which will require the test to skip over every instance after the first of any word which appears multiple times in one or the other files, or a word which appears multiple times in both... but a different number of multiple times.
Another possibility which might be worth exploring would be to use hashes to count instances of each word in each file (and perhaps cast those to a second set of sorted arrays where each array element has the word and count (key and value pairs) from the hash) and then...?
Well, one could use a regex to compare ( by position) the word (key) elements in the second set of sorted arrays and decide -- accounting for case or count or both or neither -- if you'll accept a pair as "exactly the same" or not.
The arithmetic for determining the similarity percentage is left as an exercise for the OP [ :-) ] ... or, someone with better brains or more free time than I have at the moment.
UPDATE: 0740 EDT 20150419:
Found in C:\Perl\lib\pods\perlfaq4.pod
How do I test whether two arrays or hashes are equal?
With Perl 5.10 and later, the smart match operator can give you the
answer with the least amount of work:
use 5.010;
if( @array1 ~~ @array2 ) {
say "The arrays are the same";
}
if( %hash1 ~~ %hash2 ) # doesn't check values! { # <- !!!
say "The hash keys are the same";
}
....
Sometimes, a fresh dawn and fresh coffee are helpful in finding the obvious.
Update2 1230 EDT 20150421: See File Similarity Concept for a proof of concept
In reply to Re: Similarity measurement
by ww
in thread Similarity measurement
by kennedy
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |