arcnon has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse from what I am told is japanese. After looking at it with a hex editor the comma which looks likes like a baseline tick. I see '81 41' so I am thinking I'll just split on it. Am I stumbling down the right path with the following? Or is there a better method for this?
split(/\x81\x41/, $txt);

Replies are listed 'Best First'.
Re: parsing non english
by graff (Chancellor) on Nov 09, 2007 at 03:19 UTC
    If the data really is in Japanese, then Encode::Guess is likely to have a very good chance of figuring out exactly what sort of encoding is being used. The various possible encodings are sufficiently distinct from each other that the logic for identifying one vs. the other can be quite reliable.

    For that matter, I can easily look at the standard unicode-to-nonunicode mapping tables (available from http://www.unicode.org/Public/MAPPINGS/, and see that there is only one non-unicode encoding where 0x8141 maps to U+3001 "IDEOGRAPHIC COMMA" -- and that happens to be cp932. (updated to make the unicode.org link more specific)

    In any case, the one thing you DO NOT want to do is anything like this on a "raw" string:

    split( /\x81\x41/, $txt );
    That's because there is a reasonable chance that this 2-byte sequence could occur such that the "\x81" is actually the second byte of some other two-byte character, rather than being the first byte of a "wide comma". The result will be that you split in the middle of a wide character, and the data you get will be trashed. (I know this from personal experience -- Perl 5.8 was a God-send for me.)

    Find out (or figure out) what the encoding really is, use Encode to covert it to a utf8 string, find out the unicode code point for your comma character, and split on that. Assuming my deduction about cp932 is correct, then something like this will do the right thing:

    split /\x{3001}/, decode( "cp932", $txt );
    (updated to fix a typo in the charset name)

    No possibility of "false-alarm" (mis)matches that way. You can easily convert back to cp936 for output if you want, but any string manipulation within your perl script is best done on utf8 data.

Re: parsing non english
by moritz (Cardinal) on Nov 08, 2007 at 16:49 UTC
    Since you are mentioning a hex editor I guess you have problems with the charset - is that correct so far?

    Have you found any editor that can open and display your files correctly? Where does the data come from?

    In the meanwhile, read perluniintro.

      It comes from a access database. Some Japanese fellows translated some information for a doctor but they placed all the translated names in 1 field... It has been placed upon me to break it up and insert it into a new database.
      Being I am a lazy american I can barely speak english. I didnt load any foriegn charsets so I assume I am not seeing a true representation.
      Honestly is this info unicode I dont have the slighest idea.
      just guessing the comma character based what I was told it was... then viewing that character in a hex editor.
        Well, first you have to find out the encoding. Otherwise the data is just binary garbage to your and your programs.

        I'd suggest to ask the ones that produced the data.

        There are a few other possiblities, for example the text editor vim has a decent charset autodetection.

        You can also try Encode::Guess, but you have to provide it with a list of possible encodings. Try to find out which encodings are used in japan on windows.

        Once you know the charset, you can decode with (with decode from the module Encode) and work with it.