As hdb told you, "(i)t really depends...."

Here are a just few more ways that "it...depends."

  1. Are two files with variant counts of a given word "exactly the same?"
  2. Does a pair of files in which one has certain word capitalized and the other has that word but with all chars in all lower case satisfy your spec?
  3. How do you wish to categorize a pair of files in which two different forms of a particular word occur; as, for example, when one file has a word hyphenated (at the end of a line) and the other file has the word in a position where it is not hyphenated.

Despite the wisdom offered by others (above), I suspect this is do-able.

I suspect one might tackle this problem by using sorted arrays of the two files, and testing for matches by position (which will require the test to skip over every instance after the first of any word which appears multiple times in one or the other files, or a word which appears multiple times in both... but a different number of multiple times.

Another possibility which might be worth exploring would be to use hashes to count instances of each word in each file (and perhaps cast those to a second set of sorted arrays where each array element has the word and count (key and value pairs) from the hash) and then...?

Well, one could use a regex to compare ( by position) the word (key) elements in the second set of sorted arrays and decide -- accounting for case or count or both or neither -- if you'll accept a pair as "exactly the same" or not.

The arithmetic for determining the similarity percentage is left as an exercise for the OP [   :-)   ] ... or, someone with better brains or more free time than I have at the moment.

UPDATE: 0740 EDT 20150419:

Found in C:\Perl\lib\pods\perlfaq4.pod
  How do I test whether two arrays or hashes are equal?
    With Perl 5.10 and later, the smart match operator can give you the
    answer with the least amount of work:

        use 5.010;

        if( @array1 ~~ @array2 ) {
            say "The arrays are the same";
        }

        if( %hash1 ~~ %hash2 ) # doesn't check values!  {    # <- !!!
            say "The hash keys are the same";
        }

    ....

Sometimes, a fresh dawn and fresh coffee are helpful in finding the obvious.

Update2 1230 EDT 20150421: See File Similarity Concept for a proof of concept


In reply to Re: Similarity measurement by ww
in thread Similarity measurement by kennedy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.