comment on

Hi, first thing to do is to figure out in what encoding the japanese characters are being returned. Likely candidates are UTF-8, UCS-2 or CP932. There are several ways to find out:

1) - theoretical approach

Read all the docs and merge what they tell you... Not recommended :)

2) - trial and error

Try to convert the string ($row[1] in your case) using

$utf8 = Encode::decode('assumed-encoding-of-s', $s)
[download]

until you end up with a valid UTF-8 string in $utf8. As you probably don't know yet how to tell the latter, I guess the next approach is better suited, though

3) - empirical analysis

print the byte representation of the string in hex

print unpack("H*", $s);
[download]

and look up what you get in one of the encoding tables that you can find via Google.

Just as an example, the following code

use Encode "encode";

my $a = "\x{3042}";  # hiragana 'a' == codepoint U+3042
my $a_enc = {
    # common unicode encodings
    utf8   => $a,
    ucs2be => encode("ucs2be", $a),
    ucs2le => encode("ucs2le", $a),

    # common jp legacy encodings
    sjis   => encode("sjis",   $a),
    cp932  => encode("cp932",  $a),  # MS version of shift-jis
    eucjp  => encode("eucjp",  $a),
    
    # ASCII not possible!
    ascii  => encode("ascii",  $a),  # -> renders as '?' (3f)
};

for my $encoding (sort keys %$a_enc) {
    printf "%-6s : %s\n", $encoding,
                          unpack("H*", $a_enc->{$encoding});
}
[download]

prints out the hex representation of Hiragana 'a' in various encodings:

ascii  : 3f
cp932  : 82a0
eucjp  : a4a2
sjis   : 82a0
ucs2be : 3042
ucs2le : 4230
utf8   : e38182
[download]

Generally, it's NOT possible to convert this character to ASCII, so there's no use in trying...

In order to actually show the character "on the screen", you'd need some program that can handle unicode characters, e.g. some UTF-8 capable terminal emulator (BTW, is this Windows, Linux, OS-X, or what?).

Best way is probably to use your browser (most modern browsers - like Firefox - can display unicode, presuming proper fonts are installed -- if it does, the next character should be japanese: あ ). To do so, let your perl program create HTML entity representations of the unicode characters, and embed those into some HTML page. For the purpose at hand, the '&#xCODEPOINT-IN-HEX;' form is easiest to generate. As you might have figured from the above example, the 'ucs2be' representation is equal to the unicode codepoint, so, presuming the character $ch is in UTF-8, you could do

$html_entity = '&#x'.unpack("H*", encode("ucs2be", $ch)).';';
[download]

Alternatively, if you declare the HTML page's encoding as content="text/html; charset=utf8" you can pass through the string as it is (first make sure it is in UTF-8, of course). Also make sure the corresponding filehandle is opened as utf8.

Cheers,
Almut

BTW, get rid of that $i++ in your code :) -- it is useless at best. Actually, it's responsible for that weird 1 in your "My ustring is now 1" (bonus points if you figure out why). The other weird 1s (at the end of "ascii1") are due to getcode() returning _two_ values in list context: the encoding, and the number of chararcters...

In reply to Re: MS Access Input -> Japanese Output by almut
in thread MS Access Input -> Japanese Output by Zettai

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.