in reply to MS Access Input -> Japanese Output

Hi, first thing to do is to figure out in what encoding the japanese characters are being returned. Likely candidates are UTF-8, UCS-2 or CP932. There are several ways to find out:

1) - theoretical approach

Read all the docs and merge what they tell you... Not recommended :)

2) - trial and error

Try to convert the string ($row[1] in your case) using

$utf8 = Encode::decode('assumed-encoding-of-s', $s)

until you end up with a valid UTF-8 string in $utf8. As you probably don't know yet how to tell the latter, I guess the next approach is better suited, though

3) - empirical analysis

print the byte representation of the string in hex

print unpack("H*", $s);

and look up what you get in one of the encoding tables that you can find via Google.

Just as an example, the following code

use Encode "encode"; my $a = "\x{3042}"; # hiragana 'a' == codepoint U+3042 my $a_enc = { # common unicode encodings utf8 => $a, ucs2be => encode("ucs2be", $a), ucs2le => encode("ucs2le", $a), # common jp legacy encodings sjis => encode("sjis", $a), cp932 => encode("cp932", $a), # MS version of shift-jis eucjp => encode("eucjp", $a), # ASCII not possible! ascii => encode("ascii", $a), # -> renders as '?' (3f) }; for my $encoding (sort keys %$a_enc) { printf "%-6s : %s\n", $encoding, unpack("H*", $a_enc->{$encoding}); }

prints out the hex representation of Hiragana 'a' in various encodings:

ascii : 3f cp932 : 82a0 eucjp : a4a2 sjis : 82a0 ucs2be : 3042 ucs2le : 4230 utf8 : e38182

Generally, it's NOT possible to convert this character to ASCII, so there's no use in trying...

In order to actually show the character "on the screen", you'd need some program that can handle unicode characters, e.g. some UTF-8 capable terminal emulator (BTW, is this Windows, Linux, OS-X, or what?).

Best way is probably to use your browser (most modern browsers - like Firefox - can display unicode, presuming proper fonts are installed -- if it does, the next character should be japanese: あ ). To do so, let your perl program create HTML entity representations of the unicode characters, and embed those into some HTML page. For the purpose at hand, the '&#xCODEPOINT-IN-HEX;' form is easiest to generate. As you might have figured from the above example, the 'ucs2be' representation is equal to the unicode codepoint, so, presuming the character $ch is in UTF-8, you could do

$html_entity = '&#x'.unpack("H*", encode("ucs2be", $ch)).';';

Alternatively, if you declare the HTML page's encoding as content="text/html; charset=utf8" you can pass through the string as it is (first make sure it is in UTF-8, of course). Also make sure the corresponding filehandle is opened as utf8.

Cheers,
Almut

BTW, get rid of that $i++ in your code :) -- it is useless at best. Actually, it's responsible for that weird 1 in your "My ustring is now 1" (bonus points if you figure out why). The other weird 1s (at the end of "ascii1") are due to getcode() returning _two_ values in list context: the encoding, and the number of chararcters...

Replies are listed 'Best First'.
Re^2: MS Access Input -> Japanese Output
by Zettai (Acolyte) on Nov 23, 2006 at 12:21 UTC
    Hi Almut,

    Thank you very much for your reply. I decided to take the obscurity of the terminal window out of the picture and am now trying to generate it as a html page to view in Firefox.

    This is all being done on Windows XP Professional.

    At the moment I can do this with output to Firefox:

    1) Using ASP and MS Access - OK!
    2) Using ASP and MySQL - question marks.
    3) Using Perl and MS Access - question marks.
    4) Using Perl and MySQL - question marks.

    So naturally I would like to do this with Perl and MySQL. When looking at the data in MySQL with a GUI like NaviCat I can see all the Japanese characters no worries. As part of the MySQL my.ini file i have amongst others, the following settings:

    # The default character set that will be used when a new schema or table is created and no character set is defined
    default-character-set=utf8

    As part of the client section I have:

    mysql

    default-character-set=utf8

    Also using something like 'MySQL Query Browser' I can again see the Japanese characters when I perform a simple query such as "SELECT kana, kanji FROM vocab".

    I must admit a lot of what you said went straight over my head and I didn't really understand it.

    I am sorry to say but I have no idea how to do this:
    "To do so, let your perl program create HTML entity representations of the unicode characters, and embed those into some HTML page. For the purpose at hand, the ' ODEPOINT-IN-HEX;' form is easiest to generate."

    Here is the code I am currently using:

    use DBI; use CGI qw/:standard :html3/; use CGI::Carp qw(fatalsToBrowser); my $query = CGI -> new; print header(-charset=>'utf-8'); my $dbh = DBI->connect('DBI:ODBC:japmysql','','') or die "Cannot conne +ct: $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die("Couldn't prepare"); $sth->execute or die ("Could not execute SQL statement."); if($sth->rows == 0) { print "No information."; } else { print "<table border=1>\n"; while( $resptr = $sth->fetchrow_hashref() ) { print "<tr>"; print "<td>" . $resptr->{"English"}; print "<td>" . $resptr->{"Kana"}; print "<td>" . $resptr->{"Kanji"}; print "\n"; } print "</table>\n"; } print "</body></html>\n"; $dbh->disconnect;

    Can you please show me what you mean using very short words? :)

    Thanks again for all your help!

      Presuming the data you get from mysql actually is in UTF-8, you could try two alternatives:

      (1) - The straightforward approach: add a line binmode STDOUT, ":utf8"; (e.g. before the line print header(-charset=>'utf-8');

      This might already suffice, as it's supposed to make sure that the CGI output is being written as UTF-8 (theoretically, there could still be a few other pitfalls, though...)

      (2) - The safe approach: add a little conversion function

      use Encode "encode"; sub U2Entity { return '&#x'.unpack("H*", encode("ucs2be", shift)).';'; }

      and then modify the lines where you output the non-english table cells to read: print "<td>" . U2Entity( $resptr->{"Kana"} ); , etc.

      In this case you no longer need to declare -charset=>'utf-8', as now everything is output in plain ASCII -- for example the Hiragana 'a' would be converted to its HTML entity representation &#x3042;

      If both of these approaches do not work (which might well be the case), then the data from mysql most likely isn't in UTF-8. To figure out what you're actually being passed from the mysql side, try the following:

      Add another little function

      sub Any2Hex { return unpack("H*", shift); }

      and then use that one instead of U2Entity(), i.e. print "<td>" . Any2Hex( $resptr->{"Kana"} );

      and report back which hex values you're getting... :)

      (In case you're merely seeing 3f (the hex value of the ASCII code representing the question mark character), then the problem is most likely the mysql setup... --> How did you create your mysql tables, etc...?)

      Cheers,
      Almut

      BTW, just as an aside: to create a proper, complete HTML file, you'll probably also want to output the <html> and <body> tags...

        Hi Almut,

        Thank you for all your help. I really do appreciate it. You were right. I am receiving ascii from the MySQL side. The output barring a couple of 4e41's is mainly 3f3f3f3f's. :(

        A really puzzling thing though is that while I was waiting for your response I managed to connect to the MySQL database with Java/JSP and it outputs to the webpage all of the Japanese.

        But as far as setting up the database in MySQL during the installation of the MySQL server 5.0 (v5.0.27) I chose the following parameters:

        1. Developer Machine
        2. Multifunctional Database
        3. InnoDB Tablespace Settings - I just chose where to store the data files.
        4. Decision Support(DSS)/OLAP - low number of connections cause at this stage it's just me mucking around with it.
        5. Enabled TCP/IP Networking and Strict Mode
        6. Best Support for Multilingualism - Make UTF8 the default character set (it even has Nihongo in Kanji which is a nice touch to this option)
        7. Install as a windows service and Include BIN directory in Windows PATH - both ticked.
        8. Set the root user password....etc....

        So the 'default character set' option is probably the most relevant.

        I then used Navicat which is a client that can connect to a MySQL database server; to Import all of the tables from MS Access into MySQL. First I did this into the 'test' database that is automatically generated by MySQL during it's installation but doesn't have any tables in it.

        When I opened an imported table I was overjoyed to see Japanese characters and not question marks. I then checked the tables in Navicat's design view and found the following (relevant) parameters:
        Character set: utf8
        Collation: utf8_general_ci

        So it would appear that the data is being stored in the MySQL database as utf8 based on the settings appearing in Navicat but Perl says it is ascii based on all of those hex codes returned and the display of question marks.

        The thing I don't get is that Java was able to do it on the JSP page. This makes me think that maybe it is an IIS thing. As I previously mentioned ASP works ok with a MS Access file but shows the same behaviour (i.e. the question marks) as Perl when used with the MySQL database.

        Perhaps the JSP works only because it is actually going through the embedded Tomcat webserver in the NetBeans IDE.

        What do you reckon? Looks like IIS eh? God I am so sorry!!! I was sure you would not reply last time. Thrown in the idiot user basket - "Are you sure your computer's plugged in..." scenario...

        But thank you so much for your patience and guidance. I will look further into the IIS angle. A quick google looks like this may well be the case.

        My sincerest apologies for the time you may have wasted on this but I really do appreciate you sticking with it for this long so we could at least narrow it down to a possible IIS problem.

        Thank you!!!