Hi, first thing to do is to figure out in what encoding the japanese characters are being returned. Likely candidates are UTF-8, UCS-2 or CP932. There are several ways to find out:
1) - theoretical approach
Read all the docs and merge what they tell you... Not recommended :)
2) - trial and error
Try to convert the string ($row[1] in your case) using
$utf8 = Encode::decode('assumed-encoding-of-s', $s)
until you end up with a valid UTF-8 string in $utf8. As you probably don't know yet how to tell the latter, I guess the next approach is better suited, though
3) - empirical analysis
print the byte representation of the string in hex
print unpack("H*", $s);
and look up what you get in one of the encoding tables that you can find via Google.
Just as an example, the following code
use Encode "encode"; my $a = "\x{3042}"; # hiragana 'a' == codepoint U+3042 my $a_enc = { # common unicode encodings utf8 => $a, ucs2be => encode("ucs2be", $a), ucs2le => encode("ucs2le", $a), # common jp legacy encodings sjis => encode("sjis", $a), cp932 => encode("cp932", $a), # MS version of shift-jis eucjp => encode("eucjp", $a), # ASCII not possible! ascii => encode("ascii", $a), # -> renders as '?' (3f) }; for my $encoding (sort keys %$a_enc) { printf "%-6s : %s\n", $encoding, unpack("H*", $a_enc->{$encoding}); }
prints out the hex representation of Hiragana 'a' in various encodings:
ascii : 3f cp932 : 82a0 eucjp : a4a2 sjis : 82a0 ucs2be : 3042 ucs2le : 4230 utf8 : e38182
Generally, it's NOT possible to convert this character to ASCII, so there's no use in trying...
In order to actually show the character "on the screen", you'd need some program that can handle unicode characters, e.g. some UTF-8 capable terminal emulator (BTW, is this Windows, Linux, OS-X, or what?).
Best way is probably to use your browser (most modern browsers - like Firefox - can display unicode, presuming proper fonts are installed -- if it does, the next character should be japanese: あ ). To do so, let your perl program create HTML entity representations of the unicode characters, and embed those into some HTML page. For the purpose at hand, the 'ODEPOINT-IN-HEX;' form is easiest to generate. As you might have figured from the above example, the 'ucs2be' representation is equal to the unicode codepoint, so, presuming the character $ch is in UTF-8, you could do
$html_entity = '&#x'.unpack("H*", encode("ucs2be", $ch)).';';
Alternatively, if you declare the HTML page's encoding as content="text/html; charset=utf8" you can pass through the string as it is (first make sure it is in UTF-8, of course). Also make sure the corresponding filehandle is opened as utf8.
Cheers,
Almut
BTW, get rid of that $i++ in your code :) -- it is useless at best. Actually, it's responsible for that weird 1 in your "My ustring is now 1" (bonus points if you figure out why). The other weird 1s (at the end of "ascii1") are due to getcode() returning _two_ values in list context: the encoding, and the number of chararcters...
In reply to Re: MS Access Input -> Japanese Output
by almut
in thread MS Access Input -> Japanese Output
by Zettai
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |