CJK / shiftjis

javadba has asked for the wisdom of the Perl Monks concerning the following question:

(caveat pre-emptor: I am a perl newbie..)

There are CJK errors when reading a text file - we do have Japanese content so this would be expected. We need support for UTF8 and shiftjis. Can I use a layered approach? Here is the code and the error

Current code:

  open(INFD,  '<:utf8',  $inFile)  or die "Cannot open  $inFile: $!";
[download]

Current Error:

  ERROR:  invalid byte sequence for encoding "UTF8": 0xe0b8
  HINT:  This error can also happen if the byte sequence does not matc
+h the encoding expected by the server, which is controlled by "client
+_encoding".
[download]

One Perl site recommended layered approach:

  open(INFD,  '<:utf8',  $inFile)  or die "Cannot open $inFile: $!";
  open(INFD,  '<:shiftjis',  $inFile)  or die "Cannot open $inFile: $!
+";
[download]

Is this the way to go? Other ideas?

Comment on CJK / shiftjis Select or Download Code

Replies are listed 'Best First'.
Re: CJK / shiftjis by JavaFan (Canon) on Dec 14, 2010 at 15:27 UTC
I've little experience with encodings, but if the file contains shiftjis encoding, opening it with the utf8 layer is wrong. First opening it with the utf8 layer, then with shiftjis is pointless, the first layer won't stick. There's a syntax to open a file with multiple layers at once, but I don't remember the syntax off-hand. However, I don't think that will be correct for your problem. You write We need support for UTF8 and shiftjis. Do you mean your file contains both UTF8 and shiftjis data? I do not think any of the encoding layers is going to help you there - it's not smart enough to decide which bytes should be considered UTF-8, and which bytes shiftjis. You'll have to open it in binary mode, read in chunks all in the same encoding, and then do whatever you need to do.	[reply]
Re: CJK / shiftjis by Anonyrnous Monk (Hermit) on Dec 14, 2010 at 15:57 UTC
We need support for UTF8 and shiftjis There's Encode::Guess for when you need to determine automatically which of several encodings a file has. It sometimes even guesses correctly, in particular if you need to distinguish between only two encodings whose representations aren't too similar. But don't expect miracles. Other than that, specifying the appropriate encoding layer for `open()` should work just fine. Have you tried `shiftjis` (instead of `utf8`) for the file where `utf8` failed?	[reply] [d/l] [select]