Japanese: detect hiragana/katakana/fulll width eisuuji

GaijinPunch has asked for the wisdom of the Perl Monks concerning the following question:

Monks

I'll make this short & sweet. I'm using Jcode and Kakasi on a rather sizeable list of strings. Kakasi behaves differently when you pass it hiragana or kanji. Is there any mod that will tell you what a certain double-byte character is? For this particular script, everything ends up utf8. I guess the non-lazy way would be to examine both bits against a table. But hey, it's Sunday. What can I say?

I know EUC characters fall in a nice range (hiragana, katakana, and full width eisuuji being at the first, if memory serves). Guess I should dig around and look a bit closer at utf8.

Comment on Japanese: detect hiragana/katakana/fulll width eisuuji

Replies are listed 'Best First'.

Re: Japanese: detect hiragana/katakana/fulll width eisuuji
by ikegami (Patriarch) on Feb 01, 2009 at 05:49 UTC

As per perlunicode,
\p{Hiragana} will match a hiragana character when used in a regexp.
\p{Katakana} will match a katakana character when used in a regexp.
\p{Han} will match a kanji character when used in a regexp.
You can also negate those. See the referenced document.

Note that the text must have been decoded first (by using :encoding() on open, binmode or use open, or utf8::decode() or Encode::decode() or use utf8; for literals).

use strict;
use warnings;

use open ':std', ':locale';

$_ = <<"__EOI__";
\x{6F22}\x{5B57}
\x{3072}\x{3089}\x{304C}\x{306A}
\x{30AB}\x{30BF}\x{30AB}\x{30CA}
__EOI__

my $hiragana = join ' ', /\p{Hiragana}+/g;
my $katakana = join ' ', /\p{Katakana}+/g;
my $kanji    = join ' ', /\p{Han}+/g;

print("hiragana: $hiragana\n");
print("katakana: $katakana\n");
print("kanji:    $kanji\n");
[download]

hiragana: ひらがな
katakana: カタカナ
kanji:    漢字

[reply]
[d/l]
[select]

Re^2: Japanese: detect hiragana/katakana/fulll width eisuuji

by GaijinPunch (Pilgrim) on Feb 01, 2009 at 06:10 UTC

Thanks, that helps. Although the full width roman characters are the ones giving me the unusable results. :|

[reply]

Re^3: Japanese: detect hiragana/katakana/fulll width eisuuji

by ikegami (Patriarch) on Feb 01, 2009 at 06:29 UTC

http://unicode.org/charts/PDF/UFF00.pdf

To detect:

`[\x{FF01}-\x{FF60}\x{FFE0}-\x{FFE6}]`	Full widths ASCII variants, brackets and symbols
`[\x{FF01}-\x{FF5E}]`	Full widths ASCII variants
`[\x{FF21}-\x{FF3A}]`	Full widths ASCII uppercase letters
`[\x{FF41}-\x{FF5A}]`	Full widths ASCII lowercase letters
`[\x{FF10}-\x{FF19}]`	Full widths ASCII digits

To convert:

my %fullwidth_to_narrow = map chr, (
   ( map { $_ => $_-0xFF01+0x21 } 0xFF01..0xFF5E ),
   0xFF5F => 0x2985,
   0xFF60 => 0x2986,
   0xFFE0 => 0x00A2,
   0xFFE1 => 0x00A3,
   0xFFE2 => 0x00AC,
   0xFFE3 => 0x00AF,
   0xFFE4 => 0x00A6,
   0xFFE5 => 0x00A5,
   0xFFE6 => 0x20A9,
);
[download]

[reply]
[d/l]
[select]