in reply to MS Access Input -> Japanese Output

I am trying to take some Japanese vocabulary from a MS Access file and then first print it as output to the screen but eventually I want to put it on a website.

You need to know which character encoding is being used for Japanese in the MS Access file. (I wouldn't really know; cp932 seems likely, or shiftjis may also work, but you should try to confirm that somehow. Check out Encode::Guess.)

And what sort of "screen" are you talking about? Is it an app that has the appropriate fonts and can correctly display the Japanese text data from Access? If so, it is presumably using the same character encoding that Access is using, and maybe you just want to preserve that encoding, even when putting the data onto a web page.

Preserving the existing encoding is easy enough -- just don't do anything but fetch the data and pass it along as-is. If you have reasons for converting it to unicode, utf8 is the best encoding for that (it's what perl uses internally, so you start with conversion to utf8 anyway). Note that you need a utf8-capable display to view such data. (It sounds like you have such a display tool already, since you mentioned seeing "question marks" where you expected Hiragana and Kanji -- that's what you get when a utf8-based display is given non-utf8 data.)

You would want to convert to utf8 if you intend to do regex matching, and/or substitutions, and/or any sort of character-based (rather than byte-based) manipulation on strings. Doing this sort of thing on non-unicode Japanese text is a risky business at best -- it's possible (and not that hard) to corrupt the data beyond recognition or repair.

"ascii1" is not a valid designation for any sort of character encoding. (How did you come up with that?)

Anyway, let's assume that the Access database has stuff in cp932. Here's how you'd asjust the OP code to output the data as utf8:

use DBI; use Encode; binmode STDOUT, ":utf8"; # this will avoid warnings on output my $dbh = DBI->connect('DBI:ODBC:japan','','') or die "Cannot connect: $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die "Cannot prepare: $DBI::errstr\n"; $sth->execute or die "Cannot execute: $DBI::errstr\n"; my $rownum = 1; while( my ($eng,$kana,$kanji) = $sth->fetchrow_array() ) { # $eng is presumably ASCII already -- no conversion needed $_ = decode( 'cp932', $_ ) for ( $kana, $kanji ); printf( "%d:\t%s\t%s\t%s\n", $rownum++, $eng, $kana, $kanji ); } $dbh->disconnect;
(not tested, but should be close to what you need)

Replies are listed 'Best First'.
Re^2: MS Access Input -> Japanese Output
by Zettai (Acolyte) on Nov 13, 2006 at 03:20 UTC
    Thanks for your help. But this doesn't work. At this stage I am outputting to the debug output terminal within Komodo. I still get question marks for hiragana and kanji.
    "ascii1" is not a valid designation for any sort of character encoding. (How did you come up with that?)
    I got this as output from my program.
    foreach $i (@row) { print(getcode($i), "\n"); $i++; }
    But at this point it feels like I am testing the data encoding too late. It has already been parsed? by perl and put into a Perl array.

    Somewhere between:

    $sth->execute or die "Cannot execute: $DBI::errstr\n";

    and

    @row = $sth->fetchrow_array();
    is where I should be testing the encoding of my data shouldn't I? I don't know how to do this part of the program.

    There is a great article at:
    http://ahinea.com/en/tech/perl-unicode-struggle.html

    But again I was unable to adjust the information in it to suit my needs. The terminal can output the hiragana/kanji if it is already in UTF-8 encoding but I just can't get it into that encoding after I take it from MS Access.

    Please a little more help.

      I got this as output from my program.
      foreach $i (@row) { print(getcode($i), "\n"); $i++; }

      Ah. Sorry, I should have pointed out earlier that there is a problem with that loop. You need to study Perl syntax a little more...

      When you say for $i ( @row ) (or "foreach"), $i is being set to each successive value of @row on each iteration -- in other words, $i is not an array index, it is the value stored at each element of the array. So do not increment $i in that sort of situation, because it makes no sense to do that. (That's probably where the "1" is coming from.)

      So on the first iteration through that loop, you are looking at the English field, which is presumably ascii data. You still need to figure out what encoding is being used in the latter two fields (Kana and Kanji). I gather that the "getcode" method in Jcode is supposed to return the encoding -- here's what the documentation says:

             ($code, $nmatch) = getcode($str)
               Returns char code of $str. Return codes are as follows
      
                ascii   Ascii (Contains no Japanese Code)
                binary  Binary (Not Text File)
                euc     EUC-JP
                sjis    SHIFT_JIS
                jis     JIS (ISO-2022-JP)
                ucs2    UCS2 (Raw Unicode)
                utf8    UTF8
      
      So this method should tell you what you need to know. I'll try again with a snippet suggestion:
      binmode STDOUT, ":utf8"; # connect and run your query on Access db... then: my @row = $sth->fetchrow_array; my $eng = shift @row; # first field is English my $kana = shift @row; # second field is Kana my $kanji = shift @row; # third field is Kanji my $kana_enc = getcode( $kana ); my $kanji_enc = getcode( $kanji ); if ( $kana_encoding ne $kanji_encoding ) { warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e +nc\n"; } my $kana_utf8 = decode( $kana_enc, $kana ); my $kanju_utf8 = decode( $kanji_enc, $kanji ); printf( "English: %s Kana: %s Kanji: %s\n", $eng, $kana_utf8, $kanji +_utf8 );

      You just said "this doesn't work"... You have to be more explicit. Show the actual code you used, including the modifications you made according to my suggestions (so I can see whether you actually did as I intended), and give some sort of definition for "doesn't work", in the sense of "I expected this: ... but got this: ..." -- that is, try to show some actual data.

      (Saving the output to a file and viewing that with any sort of tool that shows byte-by-byte hex codes can be very helpful. On unix/linux and unix-tools-ported-to-windows, there's the "od" command, and just running "od -txC data.file" would do nicely.)

      Please, a little more information about what you are dealing with, and what you've done with my earlier suggestion.

      UPDATE: I just noticed that the strings returned by Jcode::getcode() might not work when passed to Encode::decode. You may need to add a hash that maps the Jcode strings to valid Encode designations:

      my %code_map = ( euc => 'euc-jp', sjis => 'shiftjis', jis => 'iso-2022-jp', ucs2 => 'UCS-2LE', utf8 => 'utf8' ); # ... my $kana_enc = getcode( $kana ); # ... $kana_utf8 = decode( $code_map{$kana_enc}, $kana ); # ...
        Thanks again for the help. Apologies for the previous obscure "...this doesn't work." comment.

        Just so you know the environment I am using:
        WinXP Professional Version 2002, Service Pack 2
        Komodo Professional 3.5.3
        perl, v5.8.8 built for MSWin32-x86-multi-thread

        There is a lot more info if I use "perl -V" in a windows command prompt but not sure you want all that. Tell me if you do.

        So this time I used:

        binmode STDOUT, ":utf8"; use DBI; use Encode; use Jcode; my $dbh = DBI->connect('DBI:ODBC:japan','','') or die "Cannot connect: + $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die "Cannot prepare: $DBI::errstr\n"; $sth->execute or die "Cannot execute: $DBI::errstr\n"; my %code_map = ( euc => 'euc-jp', sjis => 'shiftjis', jis => 'iso-2022-jp', ucs2 => 'UCS-2LE', utf8 => 'utf8' ); my @row = $sth->fetchrow_array; my $eng = shift @row; # first field is English my $kana = shift @row; # second field is Kana my $kanji = shift @row; # third field is Kanji my $kana_enc = getcode( $kana ); my $kanji_enc = getcode( $kanji ); if ( $kana_enc ne $kanji_enc ) { warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e +nc\n"; } my $kana_utf8 = decode( $kana_enc, $kana ); my $kanji_utf8 = decode( $kanji_enc, $kanji ); printf( "English: %s Kana: %s Kanji: %s\n", $eng, $kana_utf8, $kanji +_utf8 );

        I then changed to the directory where I have the program and ran it as:

        perl reply2.pl > output.txt

        Output:
        ------

        English: Ah! Kana: ? Kanji: NA

        So unfortunately there is still a question mark for anything in hiragana/kanji.

        I have cygwin installed as well so I used the 'od' command you suggested as per:

        od -txC output.txt

        Output:
        -------
        0000000 45 6e 67 6c 69 73 68 3a 20 41 68 21 20 20 4b 61
        0000020 6e 61 3a 20 3f 20 20 4b 61 6e 6a 69 3a 20 4e 41
        0000040 0d 0a
        0000042

        I don't know what all these hex values mean. What do you think?