in reply to Japanese: detect hiragana/katakana/fulll width eisuuji

As per perlunicode,
\p{Hiragana} will match a hiragana character when used in a regexp.
\p{Katakana} will match a katakana character when used in a regexp.
\p{Han} will match a kanji character when used in a regexp.
You can also negate those. See the referenced document.

Note that the text must have been decoded first (by using :encoding() on open, binmode or use open, or utf8::decode() or Encode::decode() or use utf8; for literals).

use strict; use warnings; use open ':std', ':locale'; $_ = <<"__EOI__"; \x{6F22}\x{5B57} \x{3072}\x{3089}\x{304C}\x{306A} \x{30AB}\x{30BF}\x{30AB}\x{30CA} __EOI__ my $hiragana = join ' ', /\p{Hiragana}+/g; my $katakana = join ' ', /\p{Katakana}+/g; my $kanji = join ' ', /\p{Han}+/g; print("hiragana: $hiragana\n"); print("katakana: $katakana\n"); print("kanji: $kanji\n");
hiragana: ひらがな
katakana: カタカナ
kanji:    漢字

Replies are listed 'Best First'.
Re^2: Japanese: detect hiragana/katakana/fulll width eisuuji
by GaijinPunch (Pilgrim) on Feb 01, 2009 at 06:10 UTC
    Thanks, that helps. Although the full width roman characters are the ones giving me the unusable results. :|

      http://unicode.org/charts/PDF/UFF00.pdf

      To detect:

      [\x{FF01}-\x{FF60}\x{FFE0}-\x{FFE6}]Full widths ASCII variants, brackets and symbols
      [\x{FF01}-\x{FF5E}]Full widths ASCII variants
      [\x{FF21}-\x{FF3A}]Full widths ASCII uppercase letters
      [\x{FF41}-\x{FF5A}]Full widths ASCII lowercase letters
      [\x{FF10}-\x{FF19}]Full widths ASCII digits

      To convert:

      my %fullwidth_to_narrow = map chr, ( ( map { $_ => $_-0xFF01+0x21 } 0xFF01..0xFF5E ), 0xFF5F => 0x2985, 0xFF60 => 0x2986, 0xFFE0 => 0x00A2, 0xFFE1 => 0x00A3, 0xFFE2 => 0x00AC, 0xFFE3 => 0x00AF, 0xFFE4 => 0x00A6, 0xFFE5 => 0x00A5, 0xFFE6 => 0x20A9, );