Problems in comparing two files written in Japanese

Rishiraj has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I got the code (from the internet)for comparing two files and showing the difference in contents.Now,I tried the same code for two files written in japanese language(kanji).If I save the two japanese .txt files in ANSI format,it works fine,but, if I save them in formats like 'UTF-8','unicode','unicode bigendian',it doesn't show the differences properly....keeps showing odd symbols instead of the japanese characters. Would be glad if someone could suggest some simple way of making it work for all formats(if it's possible). The code I am using is the one pasted below:

#!C:\perl\bin\perl.exe
# file_compare.pl
# Purpose: compare two files and show differences

use strict;
use warnings;

my $file1 ='E:\files\file_1.txt' or die "filename missing \n";
my $file2 = 'E:\files\file_2.txt' or die "filename missing \n";

open (FILE1, "< $file1") or die "Can not read file $file1: $! \n";
my @file1_contents = <FILE1>; # read entire contents of file
close (FILE1);

open (FILE2, "< $file2") or die "Can not read file $file2: $! \n";
my @file2_contents = <FILE2>; # read entire contents of file
close (FILE2);

my $length1 = $#file1_contents; # number of lines in first file
my $length2 = $#file2_contents; # number of lines in second file

if ($length1 > $length2) {
# first file contains more lines than second file
my $counter2 = 0;
foreach my $line_file1 (@file1_contents) {
chomp ($line_file1);

if (defined ($file2_contents[$counter2])) {
# line exists in second file
chomp (my $line_file2 = $file2_contents[$counter2]);

if ($line_file1 ne $line_file2) {
print "\nline " . ($counter2 + 1) . " \n";
print "< $line_file1 \n" if ($line_file1 ne ""); 
print "--- \n";
print "> $line_file2 \n\n" if ($line_file2 ne "");
}
}
else {
# there is no line in second file
print "\nline " . ($counter2 + 1) . " \n";
print "< $line_file1 \n" if ($line_file1 ne ""); 
print "--- \n";
print "> \n"; # this line does not exist in file2
}
$counter2++; # point to the next line in file2
}
}
else {
# second file contains more lines than first file
# or both have equal number of lines
my $counter1 = 0;
foreach my $line_file2 (@file2_contents) {
chomp ($line_file2);

if (defined ($file1_contents[$counter1])) {
# line exists in first file
chomp (my $line_file1 = $file1_contents[$counter1]);

if ($line_file1 ne $line_file2) {
print "\nline " . ($counter1 + 1) . " \n";
print "< $line_file1 \n" if ($line_file1 ne "");
print "--- \n";
print "> $line_file2 \n" if ($line_file2 ne "");
}
}
else {
# there is no line in first file
print "\nline " . ($counter1 + 1) . " \n";
print "< \n"; # this line does not exist in file1
print "--- \n";
print "> $line_file2 \n" if ($line_file2 ne "");
}
$counter1++; # point to next line in file1
}
}
[download]

Thanks in advance for any help.

Comment on Problems in comparing two files written in Japanese Download Code

Replies are listed 'Best First'.
Re: Problems in comparing two files written in Japanese by CountZero (Bishop) on Jul 17, 2009 at 06:15 UTC
Rather than inventing your own Diff-algorithm, have a look at Algorithm::Diff. it doesn't show the differences properly....keeps showing odd symbols instead of the japanese characters I do not think it has something to do with your diff-algorithm, but rather with your output device not working with the right character set or the font used does not have the correct representation of some characters. `perldoc perlunicode` and `perldoc perluniintro` CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: Problems in comparing two files written in Japanese by graff (Chancellor) on Jul 17, 2009 at 11:58 UTC
As CountZero says, when you change how the characters are encoded, you have to change your method of displaying the text in order to see the characters correctly. Or, since changing the character encoding should be a "lossless" operation, change the encoding back to whatever form can be displayed correctly. In terms of just comparing the contents of two text files, so long as both are encoded the same way, the comparison process doesn't need to care what the character encoding is -- if two lines have the same sequence of binary byte values, they are the same, otherwise, they are different. (Obviously, comparing two files of Japanese text that use two different encodings would be pointless -- they would have nothing in common.) But if you want your perl script to do anything in terms of characters (as opposed to just sequences of binary byte values), you need to specify how the file data is encoded, so that perl can convert the data to its own internal utf8 form and treat it as characters. The best way is via the "mode" argument on the open() call: `open (FILE1, "<:encoding(UTF-16BE)", $file1) or die "Can not read file + $file1: $! \n";` [download] (or whatever the encoding may be for the particular file). Note that with this technique, your perl script can read data from a file that was encoded one way, and output the data in some other encoding, simply by setting the encoding of the output file handle (or doing `binmode(STDOUT,"encoding(...)");` for printing to STDOUT). Hint: instead of hard-coding file names and encodings, use command-line args and get these values from @ARGV. One other point: in order to do line-oriented reads and operations on UTF-16 input files, you may need to adjust $/, because each line-feed byte will need to have a null byte either before or after it, depending on the byte order of the UTF-16 data.	[reply] [d/l] [select]