GaijinPunch has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks: Quick question. Having a nightmare w/ storing EUC-JP in a certain open source database, so I'm storing them in Unicode. Actually encoding the EUC to utf8 is a breeze.
$string = decode("euc-jp", $string);
Below, AFAIK, is supposed to encode a utf8 string into EUC. All I'm getting is mojibake though. Am I missing something?
$string = encode('euc-jp', $string);
It's a bit late, and not sure how long I'll be up. Pardon me if I don't immediately send praises for replies.

UPDATE: Well, I got this to work. Still curious about Encode::JP though.
use Unicode::Japanese; $string = Unicode::Japanese->new($string)->euc;

Replies are listed 'Best First'.
Re: Help with Encode::JP
by ikegami (Patriarch) on Sep 18, 2006 at 14:57 UTC

    encoding the EUC to utf8 is a breeze.

    $string = decode("euc-jp", $string);

    Something's wrong when *decode* is used to *encode*.

    UNICODE is a character set, not an encoding. You cannot store something in UNICODE format (as you claim), since there is no such thing. UNICODE characters need to be encoded in order to be stored. utf8 is a particularly convenient *encoding* for storing UNICODE characters.

    Perl strings works in a similar way. It actually has two kinds of strings. Strings of bytes, and strings of characters. Encoded strings (euc-jp, utf8, etc) are strings of bytes, whereas strings of characters cannot be stored without first being encoded.

    If you wish to store a euc-jp string as utf8, you wish to convert from one encoding to another.

    $char_string = decode("euc-jp", $jp_bytes); $utf8_bytes = encode("utf8", $char_string);

    or (U)

    $utf8_bytes = encode("utf8", decode("euc-jp", $jp_bytes));

    or

    $utf8_bytes = from_to($jp_bytes, "euc-jp", "utf8");

    or (U)

    $char_string = decode("euc-jp", $jp_bytes); binmode FH, ":encoding(euc-jp)"; print FH $char_string;
      $char_string = decode("euc-jp", $jp_bytes); $utf8_bytes = encode("utf8", $char_string);

      No, I think the OP's issue is that he's having trouble converting the utf8 encoding (as stored to and fetched from the database) back into "euc-jp", because (for example), maybe that's how the data needs to be displayed.

      Converting to utf8_bytes is kind of useless in this context.

        Sounds to me likes he wants to do both. I assumed he could figure out that all he needs to do to go in the other direction is to replace euc-jp for utf8 and vice-versa.

        $char_string = decode("utf8", $utf8_bytes); $jp_bytes = encode("euc-jp", $char_string);

        or

        $jp_bytes = encode("euc-jp", decode("utf8", $utf8_bytes));

        or

        $jp_bytes = from_to($utf8_bytes, "utf8", "euc-jp");

        or

        $char_string = decode("utf8", $utf8_bytes); binmode FH, ":encoding(euc-jp)"; print FH $char_string;
Re: Help with Encode::JP
by graff (Chancellor) on Sep 19, 2006 at 07:50 UTC
    So, you have converted the original data from euc-jp to utf8, you've stored that in the database, you've fetched the utf8 data back from the database, and now you want to convert it back to euc-jp, but using the simple "encode" call for that last step was a bust. Did I get that right?

    If so, the problem is probably that when you get the data back from the database, perl doesn't know that it's utf8 anymore, and the encode function will do the wrong thing as a result.

    The stuff coming back from the database is a string of "octets" (which happen to constitute valid utf8 data in Japanese), and you need to convert from utf8 octets to euc-jp octets, using the "from_to" function from Encode (I'll tweak the variable names for clarity):

    # prepare original data for the database: $utf8_string = decode( "euc-jp", $euc_string ); # store that to the database, then some time later, # fetch it back, and convert back to euc-jp: from_to( $db_octets, "utf8", "euc-jp" ); # now, $db_octets should be readable in an euc-jp display.
    Another way to do it, depending on whether you want to do other character-based things with the stuff that comes back from the database, is to go ahead and tell perl that the database string is utf8 (by "converting" it from utf8 octets to utf8 characters, if that make sense), and then just convert it to euc-jp when you print it -- e.g. to STDOUT:
    $utf8_string = decode( "utf8", $db_octets ); binmode STDOUT, ":encoding(euc-jp)"; print $utf8_string;