kennedy has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Similarity measurement
by hdb (Monsignor) on Apr 18, 2015 at 14:02 UTC | |
It really depends on the objective. You could count the occurrence of each word and compare frequencies. They could be identical but the text files could be completely different. A single 'not' could change the meaning of a text into its opposite. The 'diff' tool is useful to compare text files but mainly based on lines. You could consider your text file as a sequence of words and then apply a longest common subsequence algorithm as in Algorithm::Diff. | [reply] |
|
Re: Similarity measurement
by ww (Archbishop) on Apr 19, 2015 at 03:34 UTC | |
As hdb told you, "(i)t really depends...." Here are a just few more ways that "it...depends." Despite the wisdom offered by others (above), I suspect this is do-able. I suspect one might tackle this problem by using sorted arrays of the two files, and testing for matches by position (which will require the test to skip over every instance after the first of any word which appears multiple times in one or the other files, or a word which appears multiple times in both... but a different number of multiple times. Another possibility which might be worth exploring would be to use hashes to count instances of each word in each file (and perhaps cast those to a second set of sorted arrays where each array element has the word and count (key and value pairs) from the hash) and then...? Well, one could use a regex to compare ( by position) the word (key) elements in the second set of sorted arrays and decide -- accounting for case or count or both or neither -- if you'll accept a pair as "exactly the same" or not. The arithmetic for determining the similarity percentage is left as an exercise for the OP [ :-) ] ... or, someone with better brains or more free time than I have at the moment. UPDATE: 0740 EDT 20150419: Found in C:\Perl\lib\pods\perlfaq4.pod
How do I test whether two arrays or hashes are equal?
With Perl 5.10 and later, the smart match operator can give you the
answer with the least amount of work:
use 5.010;
if( @array1 ~~ @array2 ) {
say "The arrays are the same";
}
if( %hash1 ~~ %hash2 ) # doesn't check values! { # <- !!!
say "The hash keys are the same";
}
....
Sometimes, a fresh dawn and fresh coffee are helpful in finding the obvious. Update2 1230 EDT 20150421: See File Similarity Concept for a proof of concept | [reply] |
|
Re: Similarity measurement
by Khen1950fx (Canon) on Apr 18, 2015 at 22:33 UTC | |
It'll take a few minutes, but it comes back with a score. In this case, the result was: 0.999615754082613 for two files exactly the same. For two completely different files: The smilarity score for two completely different files came back at: 0.345969033635878 | [reply] [d/l] [select] |
|
Re: Similarity measurement
by Marshall (Canon) on Apr 18, 2015 at 23:48 UTC | |
One technique is Text::Levenshtein - calculate the Levenshtein edit distance between two strings. This is not easy even for a human to do. There are many language differences in syntax that this is almost impossible for a computer to do it. | [reply] |
|
Re: Similarity measurement
by FreeBeerReekingMonk (Deacon) on Apr 19, 2015 at 06:16 UTC | |
comm comes to mind, but this is a perl forum, so here is a my shot at it, although it is flawed, as permutated lines do not get registered as a difference. Nor do extra repeated lines:
Not sure who to give credit to... here is the source: http://www.cyberciti.biz/faq/command-to-display-lines-common-in-files/
| [reply] [d/l] [select] |
by Happy-the-monk (Canon) on Apr 19, 2015 at 06:42 UTC | |
Credit goes to mu, the source says. Cheers, Sören Créateur des bugs mobiles - let loose once, run everywhere. | [reply] |
|
Re: Similarity measurement
by oiskuu (Hermit) on Apr 20, 2015 at 05:39 UTC | |
The question is vague as to the intent and scope of the problem. The best I could surmise is this might relate to plagiarism detection. How literally do you mean the "based on word content"? One rudimentary approach is to try compress the text units first separately, then together (as a "solid archive"), for an estimate of entropy.
Just something to toy with. There might be a better way to get at the compressed size. | [reply] [d/l] |