Rishiraj has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I got the code (from the internet)for comparing two files and showing the difference in contents.Now,I tried the same code for two files written in japanese language(kanji).If I save the two japanese .txt files in ANSI format,it works fine,but, if I save them in formats like 'UTF-8','unicode','unicode bigendian',it doesn't show the differences properly....keeps showing odd symbols instead of the japanese characters. Would be glad if someone could suggest some simple way of making it work for all formats(if it's possible). The code I am using is the one pasted below:
#!C:\perl\bin\perl.exe # file_compare.pl # Purpose: compare two files and show differences use strict; use warnings; my $file1 ='E:\files\file_1.txt' or die "filename missing \n"; my $file2 = 'E:\files\file_2.txt' or die "filename missing \n"; open (FILE1, "< $file1") or die "Can not read file $file1: $! \n"; my @file1_contents = <FILE1>; # read entire contents of file close (FILE1); open (FILE2, "< $file2") or die "Can not read file $file2: $! \n"; my @file2_contents = <FILE2>; # read entire contents of file close (FILE2); my $length1 = $#file1_contents; # number of lines in first file my $length2 = $#file2_contents; # number of lines in second file if ($length1 > $length2) { # first file contains more lines than second file my $counter2 = 0; foreach my $line_file1 (@file1_contents) { chomp ($line_file1); if (defined ($file2_contents[$counter2])) { # line exists in second file chomp (my $line_file2 = $file2_contents[$counter2]); if ($line_file1 ne $line_file2) { print "\nline " . ($counter2 + 1) . " \n"; print "< $line_file1 \n" if ($line_file1 ne ""); print "--- \n"; print "> $line_file2 \n\n" if ($line_file2 ne ""); } } else { # there is no line in second file print "\nline " . ($counter2 + 1) . " \n"; print "< $line_file1 \n" if ($line_file1 ne ""); print "--- \n"; print "> \n"; # this line does not exist in file2 } $counter2++; # point to the next line in file2 } } else { # second file contains more lines than first file # or both have equal number of lines my $counter1 = 0; foreach my $line_file2 (@file2_contents) { chomp ($line_file2); if (defined ($file1_contents[$counter1])) { # line exists in first file chomp (my $line_file1 = $file1_contents[$counter1]); if ($line_file1 ne $line_file2) { print "\nline " . ($counter1 + 1) . " \n"; print "< $line_file1 \n" if ($line_file1 ne ""); print "--- \n"; print "> $line_file2 \n" if ($line_file2 ne ""); } } else { # there is no line in first file print "\nline " . ($counter1 + 1) . " \n"; print "< \n"; # this line does not exist in file1 print "--- \n"; print "> $line_file2 \n" if ($line_file2 ne ""); } $counter1++; # point to next line in file1 } }
Thanks in advance for any help.

Replies are listed 'Best First'.
Re: Problems in comparing two files written in Japanese
by CountZero (Bishop) on Jul 17, 2009 at 06:15 UTC
    Rather than inventing your own Diff-algorithm, have a look at Algorithm::Diff.

    it doesn't show the differences properly....keeps showing odd symbols instead of the japanese characters
    I do not think it has something to do with your diff-algorithm, but rather with your output device not working with the right character set or the font used does not have the correct representation of some characters. perldoc perlunicode and perldoc perluniintro

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Problems in comparing two files written in Japanese
by graff (Chancellor) on Jul 17, 2009 at 11:58 UTC
    As CountZero says, when you change how the characters are encoded, you have to change your method of displaying the text in order to see the characters correctly. Or, since changing the character encoding should be a "lossless" operation, change the encoding back to whatever form can be displayed correctly.

    In terms of just comparing the contents of two text files, so long as both are encoded the same way, the comparison process doesn't need to care what the character encoding is -- if two lines have the same sequence of binary byte values, they are the same, otherwise, they are different. (Obviously, comparing two files of Japanese text that use two different encodings would be pointless -- they would have nothing in common.)

    But if you want your perl script to do anything in terms of characters (as opposed to just sequences of binary byte values), you need to specify how the file data is encoded, so that perl can convert the data to its own internal utf8 form and treat it as characters. The best way is via the "mode" argument on the open() call:

    open (FILE1, "<:encoding(UTF-16BE)", $file1) or die "Can not read file + $file1: $! \n";
    (or whatever the encoding may be for the particular file). Note that with this technique, your perl script can read data from a file that was encoded one way, and output the data in some other encoding, simply by setting the encoding of the output file handle (or doing binmode(STDOUT,"encoding(...)"); for printing to STDOUT).

    Hint: instead of hard-coding file names and encodings, use command-line args and get these values from @ARGV.

    One other point: in order to do line-oriented reads and operations on UTF-16 input files, you may need to adjust $/, because each line-feed byte will need to have a null byte either before or after it, depending on the byte order of the UTF-16 data.