in reply to Re^2: MS Access Input -> Japanese Output
in thread MS Access Input -> Japanese Output

I got this as output from my program.
foreach $i (@row) { print(getcode($i), "\n"); $i++; }

Ah. Sorry, I should have pointed out earlier that there is a problem with that loop. You need to study Perl syntax a little more...

When you say for $i ( @row ) (or "foreach"), $i is being set to each successive value of @row on each iteration -- in other words, $i is not an array index, it is the value stored at each element of the array. So do not increment $i in that sort of situation, because it makes no sense to do that. (That's probably where the "1" is coming from.)

So on the first iteration through that loop, you are looking at the English field, which is presumably ascii data. You still need to figure out what encoding is being used in the latter two fields (Kana and Kanji). I gather that the "getcode" method in Jcode is supposed to return the encoding -- here's what the documentation says:

       ($code, $nmatch) = getcode($str)
         Returns char code of $str. Return codes are as follows

          ascii   Ascii (Contains no Japanese Code)
          binary  Binary (Not Text File)
          euc     EUC-JP
          sjis    SHIFT_JIS
          jis     JIS (ISO-2022-JP)
          ucs2    UCS2 (Raw Unicode)
          utf8    UTF8
So this method should tell you what you need to know. I'll try again with a snippet suggestion:
binmode STDOUT, ":utf8"; # connect and run your query on Access db... then: my @row = $sth->fetchrow_array; my $eng = shift @row; # first field is English my $kana = shift @row; # second field is Kana my $kanji = shift @row; # third field is Kanji my $kana_enc = getcode( $kana ); my $kanji_enc = getcode( $kanji ); if ( $kana_encoding ne $kanji_encoding ) { warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e +nc\n"; } my $kana_utf8 = decode( $kana_enc, $kana ); my $kanju_utf8 = decode( $kanji_enc, $kanji ); printf( "English: %s Kana: %s Kanji: %s\n", $eng, $kana_utf8, $kanji +_utf8 );

You just said "this doesn't work"... You have to be more explicit. Show the actual code you used, including the modifications you made according to my suggestions (so I can see whether you actually did as I intended), and give some sort of definition for "doesn't work", in the sense of "I expected this: ... but got this: ..." -- that is, try to show some actual data.

(Saving the output to a file and viewing that with any sort of tool that shows byte-by-byte hex codes can be very helpful. On unix/linux and unix-tools-ported-to-windows, there's the "od" command, and just running "od -txC data.file" would do nicely.)

Please, a little more information about what you are dealing with, and what you've done with my earlier suggestion.

UPDATE: I just noticed that the strings returned by Jcode::getcode() might not work when passed to Encode::decode. You may need to add a hash that maps the Jcode strings to valid Encode designations:

my %code_map = ( euc => 'euc-jp', sjis => 'shiftjis', jis => 'iso-2022-jp', ucs2 => 'UCS-2LE', utf8 => 'utf8' ); # ... my $kana_enc = getcode( $kana ); # ... $kana_utf8 = decode( $code_map{$kana_enc}, $kana ); # ...

Replies are listed 'Best First'.
Re^4: MS Access Input -> Japanese Output
by Zettai (Acolyte) on Nov 13, 2006 at 12:46 UTC
    Thanks again for the help. Apologies for the previous obscure "...this doesn't work." comment.

    Just so you know the environment I am using:
    WinXP Professional Version 2002, Service Pack 2
    Komodo Professional 3.5.3
    perl, v5.8.8 built for MSWin32-x86-multi-thread

    There is a lot more info if I use "perl -V" in a windows command prompt but not sure you want all that. Tell me if you do.

    So this time I used:

    binmode STDOUT, ":utf8"; use DBI; use Encode; use Jcode; my $dbh = DBI->connect('DBI:ODBC:japan','','') or die "Cannot connect: + $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die "Cannot prepare: $DBI::errstr\n"; $sth->execute or die "Cannot execute: $DBI::errstr\n"; my %code_map = ( euc => 'euc-jp', sjis => 'shiftjis', jis => 'iso-2022-jp', ucs2 => 'UCS-2LE', utf8 => 'utf8' ); my @row = $sth->fetchrow_array; my $eng = shift @row; # first field is English my $kana = shift @row; # second field is Kana my $kanji = shift @row; # third field is Kanji my $kana_enc = getcode( $kana ); my $kanji_enc = getcode( $kanji ); if ( $kana_enc ne $kanji_enc ) { warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e +nc\n"; } my $kana_utf8 = decode( $kana_enc, $kana ); my $kanji_utf8 = decode( $kanji_enc, $kanji ); printf( "English: %s Kana: %s Kanji: %s\n", $eng, $kana_utf8, $kanji +_utf8 );

    I then changed to the directory where I have the program and ran it as:

    perl reply2.pl > output.txt

    Output:
    ------

    English: Ah! Kana: ? Kanji: NA

    So unfortunately there is still a question mark for anything in hiragana/kanji.

    I have cygwin installed as well so I used the 'od' command you suggested as per:

    od -txC output.txt

    Output:
    -------
    0000000 45 6e 67 6c 69 73 68 3a 20 41 68 21 20 20 4b 61
    0000020 6e 61 3a 20 3f 20 20 4b 61 6e 6a 69 3a 20 4e 41
    0000040 0d 0a
    0000042

    I don't know what all these hex values mean. What do you think?

      Sorry for hijacking your thread, graff, but I think the problem lies in the inner workings of Jcode's getcode() function, which fails to identify UCS-2 under certain circumstances. An example:

      my $a = "\x{3042}"; # Hiragana 'a' show_info($a); # UTF-8 my $a_cp932 = encode("cp932", $a); show_info($a_cp932); my $a_ucs2le = encode("ucs2le", $a); show_info($a_ucs2le); my $a_ucs2be = encode("ucs2be", $a); show_info($a_ucs2be); sub show_info { my $s = shift; my $hex = unpack("H*", $s); my $enc = getcode($s); print "hex = $hex\n"; print "enc = $enc\n\n"; }

      This prints (comments added)

      hex = e38182 enc = utf8 # OK hex = 82a0 enc = sjis # OK hex = 4230 enc = ascii # wrong hex = 3042 enc = ascii # wrong

      As we can see, the latter two UCS-2 strings are incorrectly identified as "ascii"...

      Well, if you think about it, how should the function's heuristics tell apart the single-char UCS-2 strings from their regular two-char ASCII interpretations (i.e. "0B" == "\x30\x42" or "B0" == "\x42\x30")?

      Personally, I'd just look at the raw byte sequences. Sometimes, "less is more" ;)