compare files by words

jajaja has asked for the wisdom of the Perl Monks concerning the following question:

hello. i have 2 similar files and id like to compare them by words to know how many words are different. file sizes of both files are same. what would be the best way to do that? thanks i was thinking about something like this:


$good=0;
$bad=0;
while ((defined ($line1 = <INPUT1>)) && (defined ($line2 = <INPUT2>)))
{
@words1 = split(/ /, $line1);
@words2 = split(/ /, $line2);
$wordsno = @words2;
until ($wordsno-- == 0){
    if ($words1[$wordsno] eq $words2[$wordsno]) { $good++; }
    else {$bad++;}
    }
}

print "good: $good";
print "bad: $bad";
close INPUT1;
close INPUT2;
[download]

and it works :)

Comment on compare files by words Download Code

Replies are listed 'Best First'.
Re: compare files by words by Zaxo (Archbishop) on May 31, 2007 at 06:48 UTC
Different in sequence, or not appearing in the other file? Any sort of uniqueness problem looks like it needs a hash, but is that really the problem you have?. Algorithm::Diff may be helpful, but you haven't really said what you need. After Compline, Zaxo	[reply]
Re^2: compare files by words by sanPerl (Friar) on May 31, 2007 at 10:38 UTC
Zaxo is correct. I also use Algorithm::Diff to great extend. It is simple to use (once you understand the nested Array structure) and acts like Unix Diff. To increase the speed, I suggest following method. 1) First compare lines exactly as string compare, if they are same then just move ahead to next sets of lines. 2) If the lines are NOT same then use Algorithm::Diff to understand difference. Regards, SanPerl	[reply]
Re^2: compare files by words by jajaja (Initiate) on May 31, 2007 at 07:19 UTC
Its almost 2 same files. They differs in diacritic only. And i just need to know how many words have different diacritic. I dont need to know details.	[reply]
Re^3: compare files by words by blazar (Canon) on May 31, 2007 at 08:25 UTC
Its almost 2 same files. They differs in diacritic only. And i just need to know how many words have different diacritic. I dont need to know details. In this case your approach above seems fine. Did you try it? Did it fail somehow? One thing you "have" to do is to make it strict-safe. Then, for words comparison I'd write: `no warnings 'uninitialized'; ($words1[$_] eq $words2[$_] ? $good : $bad)++ for 0..(@words1>@words2 ? $#words1 : $#words2);` [download] (I suppose you want to count a word as bad if it has not a correspondent one at all. Otherwise you should change `>` into `<`. In the latter case no wouldn't be necessary.) Update: you also probably don't want to split on `/ /`, but on `' '` which is more likely to do what you mean, and in fact is also the default.	[reply] [d/l] [select]
Re^4: compare files by words by ysth (Canon) on May 31, 2007 at 10:02 UTC
Re^5: compare files by words by blazar (Canon) on May 31, 2007 at 10:30 UTC
Some notes below your chosen depth have not been shown here