Re^4: MS Access Input -> Japanese Output

Thanks again for the help. Apologies for the previous obscure "...this doesn't work." comment.

Just so you know the environment I am using:
WinXP Professional Version 2002, Service Pack 2
Komodo Professional 3.5.3
perl, v5.8.8 built for MSWin32-x86-multi-thread

There is a lot more info if I use "perl -V" in a windows command prompt but not sure you want all that. Tell me if you do.

So this time I used:

binmode STDOUT, ":utf8";

use DBI;
use Encode;
use Jcode;

my $dbh = DBI->connect('DBI:ODBC:japan','','') or die "Cannot connect:
+ $DBI::errstr\n";

my $sth = $dbh->prepare('Select English, Kana, Kanji from Vocab')
    or die "Cannot prepare: $DBI::errstr\n";
    
$sth->execute or die "Cannot execute: $DBI::errstr\n";

my %code_map = ( euc => 'euc-jp',
                 sjis => 'shiftjis',
                 jis => 'iso-2022-jp',
                 ucs2 => 'UCS-2LE',  utf8 => 'utf8' );

my @row = $sth->fetchrow_array;
my $eng = shift @row;   # first field is English
my $kana = shift @row;  # second field is Kana
my $kanji = shift @row; # third field is Kanji

my $kana_enc = getcode( $kana );
my $kanji_enc = getcode( $kanji );
if ( $kana_enc ne $kanji_enc ) {
    warn "Very strange: kana is in $kana_enc, but kanji is in $kanji_e
+nc\n";
}
my $kana_utf8 = decode( $kana_enc, $kana );
my $kanji_utf8 = decode( $kanji_enc, $kanji );

printf( "English: %s  Kana: %s  Kanji: %s\n", $eng, $kana_utf8, $kanji
+_utf8 );
[download]

I then changed to the directory where I have the program and ran it as:

perl reply2.pl > output.txt

Output:
------

English: Ah!  Kana: ?  Kanji: NA
[download]

So unfortunately there is still a question mark for anything in hiragana/kanji.

I have cygwin installed as well so I used the 'od' command you suggested as per:

od -txC output.txt

Output:
-------
0000000 45 6e 67 6c 69 73 68 3a 20 41 68 21 20 20 4b 61
0000020 6e 61 3a 20 3f 20 20 4b 61 6e 6a 69 3a 20 4e 41
0000040 0d 0a
0000042

I don't know what all these hex values mean. What do you think?

Comment on Re^4: MS Access Input -> Japanese Output Select or Download Code

Replies are listed 'Best First'.
Re^5: MS Access Input -> Japanese Output by almut (Canon) on Nov 13, 2006 at 14:19 UTC
Sorry for hijacking your thread, graff, but I think the problem lies in the inner workings of Jcode's getcode() function, which fails to identify UCS-2 under certain circumstances. An example: `my $a = "\x{3042}"; # Hiragana 'a' show_info($a); # UTF-8 my $a_cp932 = encode("cp932", $a); show_info($a_cp932); my $a_ucs2le = encode("ucs2le", $a); show_info($a_ucs2le); my $a_ucs2be = encode("ucs2be", $a); show_info($a_ucs2be); sub show_info { my $s = shift; my $hex = unpack("H*", $s); my $enc = getcode($s); print "hex = $hex\n"; print "enc = $enc\n\n"; }` [download] This prints (comments added) `hex = e38182 enc = utf8 # OK hex = 82a0 enc = sjis # OK hex = 4230 enc = ascii # wrong hex = 3042 enc = ascii # wrong` [download] As we can see, the latter two UCS-2 strings are incorrectly identified as "ascii"... Well, if you think about it, how should the function's heuristics tell apart the single-char UCS-2 strings from their regular two-char ASCII interpretations (i.e. `"0B" == "\x30\x42"` or `"B0" == "\x42\x30"`)? Personally, I'd just look at the raw byte sequences. Sometimes, "less is more" ;)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^5: MS Access Input -> Japanese Output
by almut (Canon) on Nov 13, 2006 at 14:19 UTC

Sorry for hijacking your thread, graff, but I think the problem lies in the inner workings of Jcode's getcode() function, which fails to identify UCS-2 under certain circumstances. An example:

my $a = "\x{3042}";  # Hiragana 'a'
show_info($a);       # UTF-8

my $a_cp932  = encode("cp932",  $a);
show_info($a_cp932);

my $a_ucs2le = encode("ucs2le", $a);
show_info($a_ucs2le);

my $a_ucs2be = encode("ucs2be", $a);
show_info($a_ucs2be);

sub show_info {
    my $s = shift;
    my $hex = unpack("H*", $s);
    my $enc = getcode($s);
    print "hex = $hex\n";
    print "enc = $enc\n\n";
}
[download]

This prints (comments added)

hex = e38182
enc = utf8       # OK

hex = 82a0
enc = sjis       # OK

hex = 4230
enc = ascii      # wrong

hex = 3042
enc = ascii      # wrong
[download]

As we can see, the latter two UCS-2 strings are incorrectly identified as "ascii"...

Well, if you think about it, how should the function's heuristics tell apart the single-char UCS-2 strings from their regular two-char ASCII interpretations (i.e. "0B" == "\x30\x42" or "B0" == "\x42\x30")?

Personally, I'd just look at the raw byte sequences. Sometimes, "less is more" ;)

[reply]
[d/l]
[select]