MS Access Input -> Japanese Output

Zettai has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: MS Access Input -> Japanese Output by almut (Canon) on Nov 13, 2006 at 04:14 UTC
Hi, first thing to do is to figure out in what encoding the japanese characters are being returned. Likely candidates are UTF-8, UCS-2 or CP932. There are several ways to find out: 1) - theoretical approach Read all the docs and merge what they tell you... Not recommended :) 2) - trial and error Try to convert the string (`$row[1]` in your case) using `$utf8 = Encode::decode('assumed-encoding-of-s', $s)` [download] until you end up with a valid UTF-8 string in `$utf8`. As you probably don't know yet how to tell the latter, I guess the next approach is better suited, though 3) - empirical analysis print the byte representation of the string in hex `print unpack("H", $s);` [download] and look up what you get in one of the encoding tables that you can find via Google. Just as an example, the following code use Encode "encode"; my $a = "\x{3042}"; # hiragana 'a' == codepoint U+3042 my $a_enc = { # common unicode encodings utf8 => $a, ucs2be => encode("ucs2be", $a), ucs2le => encode("ucs2le", $a), # common jp legacy encodings sjis => encode("sjis", $a), cp932 => encode("cp932", $a), # MS version of shift-jis eucjp => encode("eucjp", $a), # ASCII not possible! ascii => encode("ascii", $a), # -> renders as '?' (3f) }; for my $encoding (sort keys %$a_enc) { printf "%-6s : %s\n", $encoding, unpack("H", $a_enc->{$encoding}); } [download] prints out the hex representation of Hiragana 'a' in various encodings: `ascii : 3f cp932 : 82a0 eucjp : a4a2 sjis : 82a0 ucs2be : 3042 ucs2le : 4230 utf8 : e38182` [download] Generally, it's NOT possible to convert this character to ASCII, so there's no use in trying... In order to actually show the character "on the screen", you'd need some program that can handle unicode characters, e.g. some UTF-8 capable terminal emulator (BTW, is this Windows, Linux, OS-X, or what?). Best way is probably to use your browser (most modern browsers - like Firefox - can display unicode, presuming proper fonts are installed -- if it does, the next character should be japanese: あ ). To do so, let your perl program create HTML entity representations of the unicode characters, and embed those into some HTML page. For the purpose at hand, the '&#xCODEPOINT-IN-HEX;' form is easiest to generate. As you might have figured from the above example, the 'ucs2be' representation is equal to the unicode codepoint, so, presuming the character `$ch` is in UTF-8, you could do `$html_entity = '&#x'.unpack("H*", encode("ucs2be", $ch)).';';` [download] Alternatively, if you declare the HTML page's encoding as `content="text/html; charset=utf8"` you can pass through the string as it is (first make sure it is in UTF-8, of course). Also make sure the corresponding filehandle is opened as utf8. Cheers, Almut BTW, get rid of that `$i++` in your code :) -- it is useless at best. Actually, it's responsible for that weird 1 in your `"My ustring is now 1"` (bonus points if you figure out why). The other weird 1s (at the end of `"ascii1"`) are due to getcode() returning _two_ values in list context: the encoding, and the number of chararcters...	[reply] [d/l] [select]
Re^2: MS Access Input -> Japanese Output by Zettai (Acolyte) on Nov 23, 2006 at 12:21 UTC
Hi Almut, Thank you very much for your reply. I decided to take the obscurity of the terminal window out of the picture and am now trying to generate it as a html page to view in Firefox. This is all being done on Windows XP Professional. At the moment I can do this with output to Firefox: 1) Using ASP and MS Access - OK! 2) Using ASP and MySQL - question marks. 3) Using Perl and MS Access - question marks. 4) Using Perl and MySQL - question marks. So naturally I would like to do this with Perl and MySQL. When looking at the data in MySQL with a GUI like NaviCat I can see all the Japanese characters no worries. As part of the MySQL my.ini file i have amongst others, the following settings: # The default character set that will be used when a new schema or table is created and no character set is defined default-character-set=utf8 As part of the client section I have: mysql default-character-set=utf8 Also using something like 'MySQL Query Browser' I can again see the Japanese characters when I perform a simple query such as "SELECT kana, kanji FROM vocab". I must admit a lot of what you said went straight over my head and I didn't really understand it. I am sorry to say but I have no idea how to do this: "To do so, let your perl program create HTML entity representations of the unicode characters, and embed those into some HTML page. For the purpose at hand, the 'ODEPOINT-IN-HEX;' form is easiest to generate." Here is the code I am currently using: use DBI; use CGI qw/:standard :html3/; use CGI::Carp qw(fatalsToBrowser); my $query = CGI -> new; print header(-charset=>'utf-8'); my $dbh = DBI->connect('DBI:ODBC:japmysql','','') or die "Cannot conne +ct: $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die("Couldn't prepare"); $sth->execute or die ("Could not execute SQL statement."); if($sth->rows == 0) { print "No information."; } else { print "<table border=1>\n"; while( $resptr = $sth->fetchrow_hashref() ) { print "<tr>"; print "<td>" . $resptr->{"English"}; print "<td>" . $resptr->{"Kana"}; print "<td>" . $resptr->{"Kanji"}; print "\n"; } print "</table>\n"; } print "</body></html>\n"; $dbh->disconnect; [download] Can you please show me what you mean using very short words? :) Thanks again for all your help!	[reply] [d/l]
Re^3: MS Access Input -> Japanese Output by almut (Canon) on Nov 23, 2006 at 16:01 UTC
Presuming the data you get from mysql actually is in UTF-8, you could try two alternatives: (1) - The straightforward approach: add a line `binmode STDOUT, ":utf8";` (e.g. before the line `print header(-charset=>'utf-8');` This might already suffice, as it's supposed to make sure that the CGI output is being written as UTF-8 (theoretically, there could still be a few other pitfalls, though...) (2) - The safe approach: add a little conversion function `use Encode "encode"; sub U2Entity { return '&#x'.unpack("H", encode("ucs2be", shift)).';'; }` [download] and then modify the lines where you output the non-english table cells to read: `print "<td>" . U2Entity( $resptr->{"Kana"} );` , etc. In this case you no longer need to declare `-charset=>'utf-8'`, as now everything is output in plain ASCII -- for example the Hiragana 'a' would be converted to its HTML entity representation `あ` If both of these approaches do not work (which might well be the case), then the data from mysql most likely isn't in UTF-8. To figure out what you're actually being passed from the mysql side, try the following: Add another little function `sub Any2Hex { return unpack("H", shift); }` [download] and then use that one instead of U2Entity(), i.e. `print "<td>" . Any2Hex( $resptr->{"Kana"} );` and report back which hex values you're getting... :) (In case you're merely seeing `3f` (the hex value of the ASCII code representing the question mark character), then the problem is most likely the mysql setup... --> How did you create your mysql tables, etc...?) Cheers, Almut BTW, just as an aside: to create a proper, complete HTML file, you'll probably also want to output the `<html>` and `<body>` tags...	[reply] [d/l] [select]
Re^4: MS Access Input -> Japanese Output by Zettai (Acolyte) on Nov 23, 2006 at 17:23 UTC
Re^5: MS Access Input -> Japanese Output by almut (Canon) on Nov 23, 2006 at 17:56 UTC
Some notes below your chosen depth have not been shown here
Re: MS Access Input -> Japanese Output by graff (Chancellor) on Nov 13, 2006 at 02:44 UTC
I am trying to take some Japanese vocabulary from a MS Access file and then first print it as output to the screen but eventually I want to put it on a website. You need to know which character encoding is being used for Japanese in the MS Access file. (I wouldn't really know; cp932 seems likely, or shiftjis may also work, but you should try to confirm that somehow. Check out Encode::Guess.) And what sort of "screen" are you talking about? Is it an app that has the appropriate fonts and can correctly display the Japanese text data from Access? If so, it is presumably using the same character encoding that Access is using, and maybe you just want to preserve that encoding, even when putting the data onto a web page. Preserving the existing encoding is easy enough -- just don't do anything but fetch the data and pass it along as-is. If you have reasons for converting it to unicode, utf8 is the best encoding for that (it's what perl uses internally, so you start with conversion to utf8 anyway). Note that you need a utf8-capable display to view such data. (It sounds like you have such a display tool already, since you mentioned seeing "question marks" where you expected Hiragana and Kanji -- that's what you get when a utf8-based display is given non-utf8 data.) You would want to convert to utf8 if you intend to do regex matching, and/or substitutions, and/or any sort of character-based (rather than byte-based) manipulation on strings. Doing this sort of thing on non-unicode Japanese text is a risky business at best -- it's possible (and not that hard) to corrupt the data beyond recognition or repair. "ascii1" is not a valid designation for any sort of character encoding. (How did you come up with that?) Anyway, let's assume that the Access database has stuff in cp932. Here's how you'd asjust the OP code to output the data as utf8: use DBI; use Encode; binmode STDOUT, ":utf8"; # this will avoid warnings on output my $dbh = DBI->connect('DBI:ODBC:japan','','') or die "Cannot connect: $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die "Cannot prepare: $DBI::errstr\n"; $sth->execute or die "Cannot execute: $DBI::errstr\n"; my $rownum = 1; while( my ($eng,$kana,$kanji) = $sth->fetchrow_array() ) { # $eng is presumably ASCII already -- no conversion needed $_ = decode( 'cp932', $_ ) for ( $kana, $kanji ); printf( "%d:\t%s\t%s\t%s\n", $rownum++, $eng, $kana, $kanji ); } $dbh->disconnect; [download] (not tested, but should be close to what you need)	[reply] [d/l]
Re^2: MS Access Input -> Japanese Output by Zettai (Acolyte) on Nov 13, 2006 at 03:20 UTC
Thanks for your help. But this doesn't work. At this stage I am outputting to the debug output terminal within Komodo. I still get question marks for hiragana and kanji. "ascii1" is not a valid designation for any sort of character encoding. (How did you come up with that?) I got this as output from my program. `foreach $i (@row) { print(getcode($i), "\n"); $i++; }` [download] But at this point it feels like I am testing the data encoding too late. It has already been parsed? by perl and put into a Perl array. Somewhere between: `$sth->execute or die "Cannot execute: $DBI::errstr\n";` [download] and `@row = $sth->fetchrow_array();` [download] is where I should be testing the encoding of my data shouldn't I? I don't know how to do this part of the program. There is a great article at: http://ahinea.com/en/tech/perl-unicode-struggle.html But again I was unable to adjust the information in it to suit my needs. The terminal can output the hiragana/kanji if it is already in UTF-8 encoding but I just can't get it into that encoding after I take it from MS Access. Please a little more help.	[reply] [d/l] [select]
Re^3: MS Access Input -> Japanese Output by graff (Chancellor) on Nov 13, 2006 at 03:58 UTC
I got this as output from my program. `foreach $i (@row) { print(getcode($i), "\n"); $i++; }` [download] Ah. Sorry, I should have pointed out earlier that there is a problem with that loop. You need to study Perl syntax a little more... When you say `for $i ( @row )` (or "foreach"), $i is being set to each successive value of @row on each iteration -- in other words, $i is not an array index, it is the value stored at each element of the array. So do not increment $i in that sort of situation, because it makes no sense to do that. (That's probably where the "1" is coming from.) So on the first iteration through that loop, you are looking at the English field, which is presumably ascii data. You still need to figure out what encoding is being used in the latter two fields (Kana and Kanji). I gather that the "getcode" method in Jcode is supposed to return the encoding -- here's what the documentation says: ($code, $nmatch) = getcode($str) Returns char code of $str. Return codes are as follows ascii Ascii (Contains no Japanese Code) binary Binary (Not Text File) euc EUC-JP sjis SHIFT_JIS jis JIS (ISO-2022-JP) ucs2 UCS2 (Raw Unicode) utf8 UTF8 So this method should tell you what you need to know. I'll try again with a snippet suggestion: binmode STDOUT, ":utf8"; # connect and run your query on Access db... then: my @row = $sth->fetchrow_array; my $eng = shift @row; # first field is English my $kana = shift @row; # second field is Kana my $kanji = shift @row; # third field is Kanji my $kana_enc = getcode( $kana ); my $kanji_enc = getcode( $kanji ); if ( $kana_encoding ne $kanji_encoding ) { warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e +nc\n"; } my $kana_utf8 = decode( $kana_enc, $kana ); my $kanju_utf8 = decode( $kanji_enc, $kanji ); printf( "English: %s Kana: %s Kanji: %s\n", $eng, $kana_utf8, $kanji +_utf8 ); [download] You just said "this doesn't work"... You have to be more explicit. Show the actual code you used, including the modifications you made according to my suggestions (so I can see whether you actually did as I intended), and give some sort of definition for "doesn't work", in the sense of "I expected this: ... but got this: ..." -- that is, try to show some actual data. (Saving the output to a file and viewing that with any sort of tool that shows byte-by-byte hex codes can be very helpful. On unix/linux and unix-tools-ported-to-windows, there's the "od" command, and just running "od -txC data.file" would do nicely.) Please, a little more information about what you are dealing with, and what you've done with my earlier suggestion. UPDATE: I just noticed that the strings returned by Jcode::getcode() might not work when passed to Encode::decode. You may need to add a hash that maps the Jcode strings to valid Encode designations: `my %code_map = ( euc => 'euc-jp', sjis => 'shiftjis', jis => 'iso-2022-jp', ucs2 => 'UCS-2LE', utf8 => 'utf8' ); # ... my $kana_enc = getcode( $kana ); # ... $kana_utf8 = decode( $code_map{$kana_enc}, $kana ); # ...` [download]	[reply] [d/l] [select]
Re^4: MS Access Input -> Japanese Output by Zettai (Acolyte) on Nov 13, 2006 at 12:46 UTC
Re^5: MS Access Input -> Japanese Output by almut (Canon) on Nov 13, 2006 at 14:19 UTC