If the data really is in Japanese, then Encode::Guess is likely to have a very good chance of figuring out exactly what sort of encoding is being used. The various possible encodings are sufficiently distinct from each other that the logic for identifying one vs. the other can be quite reliable.
For that matter, I can easily look at the standard unicode-to-nonunicode mapping tables (available from http://www.unicode.org/Public/MAPPINGS/, and see that there is only one non-unicode encoding where 0x8141 maps to U+3001 "IDEOGRAPHIC COMMA" -- and that happens to be cp932. (updated to make the unicode.org link more specific)
In any case, the one thing you DO NOT want to do is anything like this on a "raw" string:
split( /\x81\x41/, $txt );
That's because there is a reasonable chance that this 2-byte sequence could occur such that the "\x81" is actually the second byte of some other two-byte character, rather than being the first byte of a "wide comma". The result will be that you split in the middle of a wide character, and the data you get will be trashed. (I know this from personal experience -- Perl 5.8 was a God-send for me.)
Find out (or figure out) what the encoding really is, use Encode to covert it to a utf8 string, find out the unicode code point for your comma character, and split on that. Assuming my deduction about cp932 is correct, then something like this will do the right thing:
split /\x{3001}/, decode( "cp932", $txt );
(updated to fix a typo in the charset name)
No possibility of "false-alarm" (mis)matches that way. You can easily convert back to cp936 for output if you want, but any string manipulation within your perl script is best done on utf8 data. |