Zettai has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have searched through the forums and this is very likely a duplicate but I really can't figure this out. I am a newbie at programming and Perl and it is probably beyond me. At the moment I am trying to take some Japanese vocabulary from a MS Access file and then first print it as output to the screen but eventually I want to put it on a website. The first row in my database is: English: 'Ah!' Kana: aa (in hiragana) Kanji: NA (because there is no kanji for this expression) So far I have this:
use DBI; use Encode; use Jcode; my $dbh = DBI->connect('DBI:ODBC:japan','','') or die "Cannot connect: + $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die "Cannot prepare: $DBI::errstr\n"; $sth->execute or die "Cannot execute: $DBI::errstr\n"; @row = $sth->fetchrow_array(); print("@row\n"); $i=0; foreach $i (@row) { print(getcode($i), "\n"); $i++; } my $ustring = decode( 'ascii1', $row[2] ); print("My ustring is now $ustring\n"); #$dbh->disconnect;
Output1: ---------- Ah! ? NA ascii1 ascii1 ascii1 Unknown encoding 'ascii1' at ... line 22.
So I get a question mark when I try to print the Hiragana and Kanji (for subsequent rows). I used Jcode hoping to see in what form the perl program receives the information. But if feels like I am doing this part wrong. Don't I need to check the data before it enters the perl structure? Even if I did this part right the encoding looks strange as per 'ascii1'. But if I change it to just 'ascii' then my output is this:
Output2: --------- Ah! ? NA ascii1 ascii1 ascii1 My ustring is now 1.
I have no idea how this all works. Can someone please help me use Jcode correctly so I can see how the program receives the data from MS Access and also how to then print out the hiragana or kanji correctly to the screen? Thanks very much for your time.

Replies are listed 'Best First'.
Re: MS Access Input -> Japanese Output
by almut (Canon) on Nov 13, 2006 at 04:14 UTC

    Hi, first thing to do is to figure out in what encoding the japanese characters are being returned. Likely candidates are UTF-8, UCS-2 or CP932. There are several ways to find out:

    1) - theoretical approach

    Read all the docs and merge what they tell you... Not recommended :)

    2) - trial and error

    Try to convert the string ($row[1] in your case) using

    $utf8 = Encode::decode('assumed-encoding-of-s', $s)

    until you end up with a valid UTF-8 string in $utf8. As you probably don't know yet how to tell the latter, I guess the next approach is better suited, though

    3) - empirical analysis

    print the byte representation of the string in hex

    print unpack("H*", $s);

    and look up what you get in one of the encoding tables that you can find via Google.

    Just as an example, the following code

    use Encode "encode"; my $a = "\x{3042}"; # hiragana 'a' == codepoint U+3042 my $a_enc = { # common unicode encodings utf8 => $a, ucs2be => encode("ucs2be", $a), ucs2le => encode("ucs2le", $a), # common jp legacy encodings sjis => encode("sjis", $a), cp932 => encode("cp932", $a), # MS version of shift-jis eucjp => encode("eucjp", $a), # ASCII not possible! ascii => encode("ascii", $a), # -> renders as '?' (3f) }; for my $encoding (sort keys %$a_enc) { printf "%-6s : %s\n", $encoding, unpack("H*", $a_enc->{$encoding}); }

    prints out the hex representation of Hiragana 'a' in various encodings:

    ascii : 3f cp932 : 82a0 eucjp : a4a2 sjis : 82a0 ucs2be : 3042 ucs2le : 4230 utf8 : e38182

    Generally, it's NOT possible to convert this character to ASCII, so there's no use in trying...

    In order to actually show the character "on the screen", you'd need some program that can handle unicode characters, e.g. some UTF-8 capable terminal emulator (BTW, is this Windows, Linux, OS-X, or what?).

    Best way is probably to use your browser (most modern browsers - like Firefox - can display unicode, presuming proper fonts are installed -- if it does, the next character should be japanese: あ ). To do so, let your perl program create HTML entity representations of the unicode characters, and embed those into some HTML page. For the purpose at hand, the '&#xCODEPOINT-IN-HEX;' form is easiest to generate. As you might have figured from the above example, the 'ucs2be' representation is equal to the unicode codepoint, so, presuming the character $ch is in UTF-8, you could do

    $html_entity = '&#x'.unpack("H*", encode("ucs2be", $ch)).';';

    Alternatively, if you declare the HTML page's encoding as content="text/html; charset=utf8" you can pass through the string as it is (first make sure it is in UTF-8, of course). Also make sure the corresponding filehandle is opened as utf8.

    Cheers,
    Almut

    BTW, get rid of that $i++ in your code :) -- it is useless at best. Actually, it's responsible for that weird 1 in your "My ustring is now 1" (bonus points if you figure out why). The other weird 1s (at the end of "ascii1") are due to getcode() returning _two_ values in list context: the encoding, and the number of chararcters...

      Hi Almut,

      Thank you very much for your reply. I decided to take the obscurity of the terminal window out of the picture and am now trying to generate it as a html page to view in Firefox.

      This is all being done on Windows XP Professional.

      At the moment I can do this with output to Firefox:

      1) Using ASP and MS Access - OK!
      2) Using ASP and MySQL - question marks.
      3) Using Perl and MS Access - question marks.
      4) Using Perl and MySQL - question marks.

      So naturally I would like to do this with Perl and MySQL. When looking at the data in MySQL with a GUI like NaviCat I can see all the Japanese characters no worries. As part of the MySQL my.ini file i have amongst others, the following settings:

      # The default character set that will be used when a new schema or table is created and no character set is defined
      default-character-set=utf8

      As part of the client section I have:

      mysql

      default-character-set=utf8

      Also using something like 'MySQL Query Browser' I can again see the Japanese characters when I perform a simple query such as "SELECT kana, kanji FROM vocab".

      I must admit a lot of what you said went straight over my head and I didn't really understand it.

      I am sorry to say but I have no idea how to do this:
      "To do so, let your perl program create HTML entity representations of the unicode characters, and embed those into some HTML page. For the purpose at hand, the ' ODEPOINT-IN-HEX;' form is easiest to generate."

      Here is the code I am currently using:

      use DBI; use CGI qw/:standard :html3/; use CGI::Carp qw(fatalsToBrowser); my $query = CGI -> new; print header(-charset=>'utf-8'); my $dbh = DBI->connect('DBI:ODBC:japmysql','','') or die "Cannot conne +ct: $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die("Couldn't prepare"); $sth->execute or die ("Could not execute SQL statement."); if($sth->rows == 0) { print "No information."; } else { print "<table border=1>\n"; while( $resptr = $sth->fetchrow_hashref() ) { print "<tr>"; print "<td>" . $resptr->{"English"}; print "<td>" . $resptr->{"Kana"}; print "<td>" . $resptr->{"Kanji"}; print "\n"; } print "</table>\n"; } print "</body></html>\n"; $dbh->disconnect;

      Can you please show me what you mean using very short words? :)

      Thanks again for all your help!

        Presuming the data you get from mysql actually is in UTF-8, you could try two alternatives:

        (1) - The straightforward approach: add a line binmode STDOUT, ":utf8"; (e.g. before the line print header(-charset=>'utf-8');

        This might already suffice, as it's supposed to make sure that the CGI output is being written as UTF-8 (theoretically, there could still be a few other pitfalls, though...)

        (2) - The safe approach: add a little conversion function

        use Encode "encode"; sub U2Entity { return '&#x'.unpack("H*", encode("ucs2be", shift)).';'; }

        and then modify the lines where you output the non-english table cells to read: print "<td>" . U2Entity( $resptr->{"Kana"} ); , etc.

        In this case you no longer need to declare -charset=>'utf-8', as now everything is output in plain ASCII -- for example the Hiragana 'a' would be converted to its HTML entity representation &#x3042;

        If both of these approaches do not work (which might well be the case), then the data from mysql most likely isn't in UTF-8. To figure out what you're actually being passed from the mysql side, try the following:

        Add another little function

        sub Any2Hex { return unpack("H*", shift); }

        and then use that one instead of U2Entity(), i.e. print "<td>" . Any2Hex( $resptr->{"Kana"} );

        and report back which hex values you're getting... :)

        (In case you're merely seeing 3f (the hex value of the ASCII code representing the question mark character), then the problem is most likely the mysql setup... --> How did you create your mysql tables, etc...?)

        Cheers,
        Almut

        BTW, just as an aside: to create a proper, complete HTML file, you'll probably also want to output the <html> and <body> tags...

Re: MS Access Input -> Japanese Output
by graff (Chancellor) on Nov 13, 2006 at 02:44 UTC
    I am trying to take some Japanese vocabulary from a MS Access file and then first print it as output to the screen but eventually I want to put it on a website.

    You need to know which character encoding is being used for Japanese in the MS Access file. (I wouldn't really know; cp932 seems likely, or shiftjis may also work, but you should try to confirm that somehow. Check out Encode::Guess.)

    And what sort of "screen" are you talking about? Is it an app that has the appropriate fonts and can correctly display the Japanese text data from Access? If so, it is presumably using the same character encoding that Access is using, and maybe you just want to preserve that encoding, even when putting the data onto a web page.

    Preserving the existing encoding is easy enough -- just don't do anything but fetch the data and pass it along as-is. If you have reasons for converting it to unicode, utf8 is the best encoding for that (it's what perl uses internally, so you start with conversion to utf8 anyway). Note that you need a utf8-capable display to view such data. (It sounds like you have such a display tool already, since you mentioned seeing "question marks" where you expected Hiragana and Kanji -- that's what you get when a utf8-based display is given non-utf8 data.)

    You would want to convert to utf8 if you intend to do regex matching, and/or substitutions, and/or any sort of character-based (rather than byte-based) manipulation on strings. Doing this sort of thing on non-unicode Japanese text is a risky business at best -- it's possible (and not that hard) to corrupt the data beyond recognition or repair.

    "ascii1" is not a valid designation for any sort of character encoding. (How did you come up with that?)

    Anyway, let's assume that the Access database has stuff in cp932. Here's how you'd asjust the OP code to output the data as utf8:

    use DBI; use Encode; binmode STDOUT, ":utf8"; # this will avoid warnings on output my $dbh = DBI->connect('DBI:ODBC:japan','','') or die "Cannot connect: $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die "Cannot prepare: $DBI::errstr\n"; $sth->execute or die "Cannot execute: $DBI::errstr\n"; my $rownum = 1; while( my ($eng,$kana,$kanji) = $sth->fetchrow_array() ) { # $eng is presumably ASCII already -- no conversion needed $_ = decode( 'cp932', $_ ) for ( $kana, $kanji ); printf( "%d:\t%s\t%s\t%s\n", $rownum++, $eng, $kana, $kanji ); } $dbh->disconnect;
    (not tested, but should be close to what you need)
      Thanks for your help. But this doesn't work. At this stage I am outputting to the debug output terminal within Komodo. I still get question marks for hiragana and kanji.
      "ascii1" is not a valid designation for any sort of character encoding. (How did you come up with that?)
      I got this as output from my program.
      foreach $i (@row) { print(getcode($i), "\n"); $i++; }
      But at this point it feels like I am testing the data encoding too late. It has already been parsed? by perl and put into a Perl array.

      Somewhere between:

      $sth->execute or die "Cannot execute: $DBI::errstr\n";

      and

      @row = $sth->fetchrow_array();
      is where I should be testing the encoding of my data shouldn't I? I don't know how to do this part of the program.

      There is a great article at:
      http://ahinea.com/en/tech/perl-unicode-struggle.html

      But again I was unable to adjust the information in it to suit my needs. The terminal can output the hiragana/kanji if it is already in UTF-8 encoding but I just can't get it into that encoding after I take it from MS Access.

      Please a little more help.

        I got this as output from my program.
        foreach $i (@row) { print(getcode($i), "\n"); $i++; }

        Ah. Sorry, I should have pointed out earlier that there is a problem with that loop. You need to study Perl syntax a little more...

        When you say for $i ( @row ) (or "foreach"), $i is being set to each successive value of @row on each iteration -- in other words, $i is not an array index, it is the value stored at each element of the array. So do not increment $i in that sort of situation, because it makes no sense to do that. (That's probably where the "1" is coming from.)

        So on the first iteration through that loop, you are looking at the English field, which is presumably ascii data. You still need to figure out what encoding is being used in the latter two fields (Kana and Kanji). I gather that the "getcode" method in Jcode is supposed to return the encoding -- here's what the documentation says:

               ($code, $nmatch) = getcode($str)
                 Returns char code of $str. Return codes are as follows
        
                  ascii   Ascii (Contains no Japanese Code)
                  binary  Binary (Not Text File)
                  euc     EUC-JP
                  sjis    SHIFT_JIS
                  jis     JIS (ISO-2022-JP)
                  ucs2    UCS2 (Raw Unicode)
                  utf8    UTF8
        
        So this method should tell you what you need to know. I'll try again with a snippet suggestion:
        binmode STDOUT, ":utf8"; # connect and run your query on Access db... then: my @row = $sth->fetchrow_array; my $eng = shift @row; # first field is English my $kana = shift @row; # second field is Kana my $kanji = shift @row; # third field is Kanji my $kana_enc = getcode( $kana ); my $kanji_enc = getcode( $kanji ); if ( $kana_encoding ne $kanji_encoding ) { warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e +nc\n"; } my $kana_utf8 = decode( $kana_enc, $kana ); my $kanju_utf8 = decode( $kanji_enc, $kanji ); printf( "English: %s Kana: %s Kanji: %s\n", $eng, $kana_utf8, $kanji +_utf8 );

        You just said "this doesn't work"... You have to be more explicit. Show the actual code you used, including the modifications you made according to my suggestions (so I can see whether you actually did as I intended), and give some sort of definition for "doesn't work", in the sense of "I expected this: ... but got this: ..." -- that is, try to show some actual data.

        (Saving the output to a file and viewing that with any sort of tool that shows byte-by-byte hex codes can be very helpful. On unix/linux and unix-tools-ported-to-windows, there's the "od" command, and just running "od -txC data.file" would do nicely.)

        Please, a little more information about what you are dealing with, and what you've done with my earlier suggestion.

        UPDATE: I just noticed that the strings returned by Jcode::getcode() might not work when passed to Encode::decode. You may need to add a hash that maps the Jcode strings to valid Encode designations:

        my %code_map = ( euc => 'euc-jp', sjis => 'shiftjis', jis => 'iso-2022-jp', ucs2 => 'UCS-2LE', utf8 => 'utf8' ); # ... my $kana_enc = getcode( $kana ); # ... $kana_utf8 = decode( $code_map{$kana_enc}, $kana ); # ...