GaijinPunch has asked for the wisdom of the Perl Monks concerning the following question:

Monks

I'll make this short & sweet. I'm using Jcode and Kakasi on a rather sizeable list of strings. Kakasi behaves differently when you pass it hiragana or kanji. Is there any mod that will tell you what a certain double-byte character is? For this particular script, everything ends up utf8. I guess the non-lazy way would be to examine both bits against a table. But hey, it's Sunday. What can I say?

I know EUC characters fall in a nice range (hiragana, katakana, and full width eisuuji being at the first, if memory serves). Guess I should dig around and look a bit closer at utf8.
  • Comment on Japanese: detect hiragana/katakana/fulll width eisuuji

Replies are listed 'Best First'.
Re: Japanese: detect hiragana/katakana/fulll width eisuuji
by ikegami (Patriarch) on Feb 01, 2009 at 05:49 UTC

    As per perlunicode,
    \p{Hiragana} will match a hiragana character when used in a regexp.
    \p{Katakana} will match a katakana character when used in a regexp.
    \p{Han} will match a kanji character when used in a regexp.
    You can also negate those. See the referenced document.

    Note that the text must have been decoded first (by using :encoding() on open, binmode or use open, or utf8::decode() or Encode::decode() or use utf8; for literals).

    use strict; use warnings; use open ':std', ':locale'; $_ = <<"__EOI__"; \x{6F22}\x{5B57} \x{3072}\x{3089}\x{304C}\x{306A} \x{30AB}\x{30BF}\x{30AB}\x{30CA} __EOI__ my $hiragana = join ' ', /\p{Hiragana}+/g; my $katakana = join ' ', /\p{Katakana}+/g; my $kanji = join ' ', /\p{Han}+/g; print("hiragana: $hiragana\n"); print("katakana: $katakana\n"); print("kanji: $kanji\n");
    hiragana: ひらがな
    katakana: カタカナ
    kanji:    漢字
    
      Thanks, that helps. Although the full width roman characters are the ones giving me the unusable results. :|

        http://unicode.org/charts/PDF/UFF00.pdf

        To detect:

        [\x{FF01}-\x{FF60}\x{FFE0}-\x{FFE6}]Full widths ASCII variants, brackets and symbols
        [\x{FF01}-\x{FF5E}]Full widths ASCII variants
        [\x{FF21}-\x{FF3A}]Full widths ASCII uppercase letters
        [\x{FF41}-\x{FF5A}]Full widths ASCII lowercase letters
        [\x{FF10}-\x{FF19}]Full widths ASCII digits

        To convert:

        my %fullwidth_to_narrow = map chr, ( ( map { $_ => $_-0xFF01+0x21 } 0xFF01..0xFF5E ), 0xFF5F => 0x2985, 0xFF60 => 0x2986, 0xFFE0 => 0x00A2, 0xFFE1 => 0x00A3, 0xFFE2 => 0x00AC, 0xFFE3 => 0x00AF, 0xFFE4 => 0x00A6, 0xFFE5 => 0x00A5, 0xFFE6 => 0x20A9, );