in reply to Problems in comparing two files written in Japanese
In terms of just comparing the contents of two text files, so long as both are encoded the same way, the comparison process doesn't need to care what the character encoding is -- if two lines have the same sequence of binary byte values, they are the same, otherwise, they are different. (Obviously, comparing two files of Japanese text that use two different encodings would be pointless -- they would have nothing in common.)
But if you want your perl script to do anything in terms of characters (as opposed to just sequences of binary byte values), you need to specify how the file data is encoded, so that perl can convert the data to its own internal utf8 form and treat it as characters. The best way is via the "mode" argument on the open() call:
(or whatever the encoding may be for the particular file). Note that with this technique, your perl script can read data from a file that was encoded one way, and output the data in some other encoding, simply by setting the encoding of the output file handle (or doing binmode(STDOUT,"encoding(...)"); for printing to STDOUT).open (FILE1, "<:encoding(UTF-16BE)", $file1) or die "Can not read file + $file1: $! \n";
Hint: instead of hard-coding file names and encodings, use command-line args and get these values from @ARGV.
One other point: in order to do line-oriented reads and operations on UTF-16 input files, you may need to adjust $/, because each line-feed byte will need to have a null byte either before or after it, depending on the byte order of the UTF-16 data.
|
|---|