in reply to Re: MS Access Input -> Japanese Output
in thread MS Access Input -> Japanese Output

Thanks for your help. But this doesn't work. At this stage I am outputting to the debug output terminal within Komodo. I still get question marks for hiragana and kanji.
"ascii1" is not a valid designation for any sort of character encoding. (How did you come up with that?)
I got this as output from my program.
foreach $i (@row) { print(getcode($i), "\n"); $i++; }
But at this point it feels like I am testing the data encoding too late. It has already been parsed? by perl and put into a Perl array.

Somewhere between:

$sth->execute or die "Cannot execute: $DBI::errstr\n";

and

@row = $sth->fetchrow_array();
is where I should be testing the encoding of my data shouldn't I? I don't know how to do this part of the program.

There is a great article at:
http://ahinea.com/en/tech/perl-unicode-struggle.html

But again I was unable to adjust the information in it to suit my needs. The terminal can output the hiragana/kanji if it is already in UTF-8 encoding but I just can't get it into that encoding after I take it from MS Access.

Please a little more help.

Replies are listed 'Best First'.
Re^3: MS Access Input -> Japanese Output
by graff (Chancellor) on Nov 13, 2006 at 03:58 UTC
    I got this as output from my program.
    foreach $i (@row) { print(getcode($i), "\n"); $i++; }

    Ah. Sorry, I should have pointed out earlier that there is a problem with that loop. You need to study Perl syntax a little more...

    When you say for $i ( @row ) (or "foreach"), $i is being set to each successive value of @row on each iteration -- in other words, $i is not an array index, it is the value stored at each element of the array. So do not increment $i in that sort of situation, because it makes no sense to do that. (That's probably where the "1" is coming from.)

    So on the first iteration through that loop, you are looking at the English field, which is presumably ascii data. You still need to figure out what encoding is being used in the latter two fields (Kana and Kanji). I gather that the "getcode" method in Jcode is supposed to return the encoding -- here's what the documentation says:

           ($code, $nmatch) = getcode($str)
             Returns char code of $str. Return codes are as follows
    
              ascii   Ascii (Contains no Japanese Code)
              binary  Binary (Not Text File)
              euc     EUC-JP
              sjis    SHIFT_JIS
              jis     JIS (ISO-2022-JP)
              ucs2    UCS2 (Raw Unicode)
              utf8    UTF8
    
    So this method should tell you what you need to know. I'll try again with a snippet suggestion:
    binmode STDOUT, ":utf8"; # connect and run your query on Access db... then: my @row = $sth->fetchrow_array; my $eng = shift @row; # first field is English my $kana = shift @row; # second field is Kana my $kanji = shift @row; # third field is Kanji my $kana_enc = getcode( $kana ); my $kanji_enc = getcode( $kanji ); if ( $kana_encoding ne $kanji_encoding ) { warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e +nc\n"; } my $kana_utf8 = decode( $kana_enc, $kana ); my $kanju_utf8 = decode( $kanji_enc, $kanji ); printf( "English: %s Kana: %s Kanji: %s\n", $eng, $kana_utf8, $kanji +_utf8 );

    You just said "this doesn't work"... You have to be more explicit. Show the actual code you used, including the modifications you made according to my suggestions (so I can see whether you actually did as I intended), and give some sort of definition for "doesn't work", in the sense of "I expected this: ... but got this: ..." -- that is, try to show some actual data.

    (Saving the output to a file and viewing that with any sort of tool that shows byte-by-byte hex codes can be very helpful. On unix/linux and unix-tools-ported-to-windows, there's the "od" command, and just running "od -txC data.file" would do nicely.)

    Please, a little more information about what you are dealing with, and what you've done with my earlier suggestion.

    UPDATE: I just noticed that the strings returned by Jcode::getcode() might not work when passed to Encode::decode. You may need to add a hash that maps the Jcode strings to valid Encode designations:

    my %code_map = ( euc => 'euc-jp', sjis => 'shiftjis', jis => 'iso-2022-jp', ucs2 => 'UCS-2LE', utf8 => 'utf8' ); # ... my $kana_enc = getcode( $kana ); # ... $kana_utf8 = decode( $code_map{$kana_enc}, $kana ); # ...
      Thanks again for the help. Apologies for the previous obscure "...this doesn't work." comment.

      Just so you know the environment I am using:
      WinXP Professional Version 2002, Service Pack 2
      Komodo Professional 3.5.3
      perl, v5.8.8 built for MSWin32-x86-multi-thread

      There is a lot more info if I use "perl -V" in a windows command prompt but not sure you want all that. Tell me if you do.

      So this time I used:

      binmode STDOUT, ":utf8"; use DBI; use Encode; use Jcode; my $dbh = DBI->connect('DBI:ODBC:japan','','') or die "Cannot connect: + $DBI::errstr\n"; my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab') or die "Cannot prepare: $DBI::errstr\n"; $sth->execute or die "Cannot execute: $DBI::errstr\n"; my %code_map = ( euc => 'euc-jp', sjis => 'shiftjis', jis => 'iso-2022-jp', ucs2 => 'UCS-2LE', utf8 => 'utf8' ); my @row = $sth->fetchrow_array; my $eng = shift @row; # first field is English my $kana = shift @row; # second field is Kana my $kanji = shift @row; # third field is Kanji my $kana_enc = getcode( $kana ); my $kanji_enc = getcode( $kanji ); if ( $kana_enc ne $kanji_enc ) { warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e +nc\n"; } my $kana_utf8 = decode( $kana_enc, $kana ); my $kanji_utf8 = decode( $kanji_enc, $kanji ); printf( "English: %s Kana: %s Kanji: %s\n", $eng, $kana_utf8, $kanji +_utf8 );

      I then changed to the directory where I have the program and ran it as:

      perl reply2.pl > output.txt

      Output:
      ------

      English: Ah! Kana: ? Kanji: NA

      So unfortunately there is still a question mark for anything in hiragana/kanji.

      I have cygwin installed as well so I used the 'od' command you suggested as per:

      od -txC output.txt

      Output:
      -------
      0000000 45 6e 67 6c 69 73 68 3a 20 41 68 21 20 20 4b 61
      0000020 6e 61 3a 20 3f 20 20 4b 61 6e 6a 69 3a 20 4e 41
      0000040 0d 0a
      0000042

      I don't know what all these hex values mean. What do you think?

        Sorry for hijacking your thread, graff, but I think the problem lies in the inner workings of Jcode's getcode() function, which fails to identify UCS-2 under certain circumstances. An example:

        my $a = "\x{3042}"; # Hiragana 'a' show_info($a); # UTF-8 my $a_cp932 = encode("cp932", $a); show_info($a_cp932); my $a_ucs2le = encode("ucs2le", $a); show_info($a_ucs2le); my $a_ucs2be = encode("ucs2be", $a); show_info($a_ucs2be); sub show_info { my $s = shift; my $hex = unpack("H*", $s); my $enc = getcode($s); print "hex = $hex\n"; print "enc = $enc\n\n"; }

        This prints (comments added)

        hex = e38182 enc = utf8 # OK hex = 82a0 enc = sjis # OK hex = 4230 enc = ascii # wrong hex = 3042 enc = ascii # wrong

        As we can see, the latter two UCS-2 strings are incorrectly identified as "ascii"...

        Well, if you think about it, how should the function's heuristics tell apart the single-char UCS-2 strings from their regular two-char ASCII interpretations (i.e. "0B" == "\x30\x42" or "B0" == "\x42\x30")?

        Personally, I'd just look at the raw byte sequences. Sometimes, "less is more" ;)