You know that the time your algorithm takes grows with the power of 2? And you reread one of the files for every single line of the other file? As longs as the two files are small that's ok, but as soon as the second file grows larger than your disk cache in memory you will get running times of hours.
Also what happens if one file has a line "x" and the other file has 100 lines "x"? Your algorithm will note that as a match even though 99 of the "x" lines in one file have no corresponding line in the other.
You want to differentiate empty lines from different lines at the same line number. At the same time you compare *all* lines in one file with any line in the other file. How do you want to count this? Is a mismatch on the same line, but a match in a different line worth 1 point (as you have now), but a match in the same line 2 points worth? And what then is the worth of a line where the same line is empty? And what does the summary at the end then tell you except a rather meaninless number ?
Ok, first suggestion, use the diff utility (always installed on any unix dialect, but should be availabel for windows too) or a Diff CPAN module as someone else suggested. If not, think carefully what you want. If you really want to compare any line of one file with any line of the other and the file sizes are smaller than GBytes, use a hash to store one file, i.e.
my %file1; my $linenumber= 0; foreach (<FILE1>) { $file1{$_}= $linenumber++; }
Then you can use the hash to find any line in the other file and it even tells you the line number where that line was found. But if you have the same line multiple times in that file, it will only tell you the last line it was found. Some more effort (more complicated data structures) would be necessary to differentiate between them
In reply to Re: on matching content of two text files
by jethro
in thread on matching content of two text files
by sarvan
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |