jajaja has asked for the wisdom of the Perl Monks concerning the following question:

hello. i have 2 similar files and id like to compare them by words to know how many words are different. file sizes of both files are same. what would be the best way to do that? thanks i was thinking about something like this:
$good=0; $bad=0; while ((defined ($line1 = <INPUT1>)) && (defined ($line2 = <INPUT2>))) { @words1 = split(/ /, $line1); @words2 = split(/ /, $line2); $wordsno = @words2; until ($wordsno-- == 0){ if ($words1[$wordsno] eq $words2[$wordsno]) { $good++; } else {$bad++;} } } print "good: $good"; print "bad: $bad"; close INPUT1; close INPUT2;
and it works :)

Replies are listed 'Best First'.
Re: compare files by words
by Zaxo (Archbishop) on May 31, 2007 at 06:48 UTC

    Different in sequence, or not appearing in the other file? Any sort of uniqueness problem looks like it needs a hash, but is that really the problem you have?.

    Algorithm::Diff may be helpful, but you haven't really said what you need.

    After Compline,
    Zaxo

      Zaxo is correct. I also use Algorithm::Diff to great extend. It is simple to use (once you understand the nested Array structure) and acts like Unix Diff.
      To increase the speed, I suggest following method.
      1) First compare lines exactly as string compare, if they are same then just move ahead to next sets of lines.
      2) If the lines are NOT same then use Algorithm::Diff to understand difference.
      Regards,
      SanPerl
      Its almost 2 same files. They differs in diacritic only. And i just need to know how many words have different diacritic. I dont need to know details.
        Its almost 2 same files. They differs in diacritic only. And i just need to know how many words have different diacritic. I dont need to know details.

        In this case your approach above seems fine. Did you try it? Did it fail somehow? One thing you "have" to do is to make it strict-safe. Then, for words comparison I'd write:

        no warnings 'uninitialized'; ($words1[$_] eq $words2[$_] ? $good : $bad)++ for 0..(@words1>@words2 ? $#words1 : $#words2);

        (I suppose you want to count a word as bad if it has not a correspondent one at all. Otherwise you should change > into <. In the latter case no wouldn't be necessary.)

        Update: you also probably don't want to split on / /, but on ' ' which is more likely to do what you mean, and in fact is also the default.