Re^5: MS Access Input -> Japanese Output

Sorry for hijacking your thread, graff, but I think the problem lies in the inner workings of Jcode's getcode() function, which fails to identify UCS-2 under certain circumstances. An example:

my $a = "\x{3042}";  # Hiragana 'a'
show_info($a);       # UTF-8

my $a_cp932  = encode("cp932",  $a);
show_info($a_cp932);

my $a_ucs2le = encode("ucs2le", $a);
show_info($a_ucs2le);

my $a_ucs2be = encode("ucs2be", $a);
show_info($a_ucs2be);

sub show_info {
    my $s = shift;
    my $hex = unpack("H*", $s);
    my $enc = getcode($s);
    print "hex = $hex\n";
    print "enc = $enc\n\n";
}
[download]

This prints (comments added)

hex = e38182
enc = utf8       # OK

hex = 82a0
enc = sjis       # OK

hex = 4230
enc = ascii      # wrong

hex = 3042
enc = ascii      # wrong
[download]

As we can see, the latter two UCS-2 strings are incorrectly identified as "ascii"...

Well, if you think about it, how should the function's heuristics tell apart the single-char UCS-2 strings from their regular two-char ASCII interpretations (i.e. "0B" == "\x30\x42" or "B0" == "\x42\x30")?

Personally, I'd just look at the raw byte sequences. Sometimes, "less is more" ;)

Comment on Re^5: MS Access Input -> Japanese Output Select or Download Code