You know that the time your algorithm takes grows with the power of 2? And you reread one of the files for every single line of the other file? As longs as the two files are small that's ok, but as soon as the second file grows larger than your disk cache in memory you will get running times of hours.

Also what happens if one file has a line "x" and the other file has 100 lines "x"? Your algorithm will note that as a match even though 99 of the "x" lines in one file have no corresponding line in the other.

You want to differentiate empty lines from different lines at the same line number. At the same time you compare *all* lines in one file with any line in the other file. How do you want to count this? Is a mismatch on the same line, but a match in a different line worth 1 point (as you have now), but a match in the same line 2 points worth? And what then is the worth of a line where the same line is empty? And what does the summary at the end then tell you except a rather meaninless number ?

Ok, first suggestion, use the diff utility (always installed on any unix dialect, but should be availabel for windows too) or a Diff CPAN module as someone else suggested. If not, think carefully what you want. If you really want to compare any line of one file with any line of the other and the file sizes are smaller than GBytes, use a hash to store one file, i.e.

my %file1; my $linenumber= 0; foreach (<FILE1>) { $file1{$_}= $linenumber++; }

Then you can use the hash to find any line in the other file and it even tells you the line number where that line was found. But if you have the same line multiple times in that file, it will only tell you the last line it was found. Some more effort (more complicated data structures) would be necessary to differentiate between them


In reply to Re: on matching content of two text files by jethro
in thread on matching content of two text files by sarvan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.