I've tried this out myself to get some character classes of interest for Arabic (it's not ready for distro yet because I need to define a few more relevant classes, but I'll try to get it out on CPAN pretty soon). The general layout goes like this:
package ArChr; =head1 NAME ArChr -- useful character properties for Unicode Arabic =head1 SYNOPSIS use ArChr; $c = "..."; # some UTF8 string $c =~ /\p{ArChr::InARletter}/; # match only Arabic letters $c =~ /\p{ArChr::InARmark}/; # match only Arabic diacritics # see description for full set of terms =head1 DESCRIPTION This module supplements the Unicode character-class definitions with special groups relevant to Arabic linguistics. The following classes +are defined: =over 4 =item InARletter Matches only the Arabic letter characters, leaving out all digits and diacritic and punctuation marks. =item InARmark Matches only the Arabic diacritic marks, leaving out all letters, digits and punctuation marks. =item InARvowel Matches vowel letters and diacritics, leaving out consonants, shadda, sukuun, and letters involving hamza. =item InARshortvowel Matches only the short-vowel diacritic marks, not sukuun or shadda. =item InARcons Matches consonant letters, hamzas and shadda, leaving out vowels and sukuun. =back =cut use strict; sub InARletter { return <<'END'; 0621 063A 0641 064A 0671 067E 0686 0698 06AF END } sub InARvowel { return <<'END'; 0627 064B 0650 END } sub InARcons { return <<'END'; +ArChr::InARletter -ArChr::InARvowel END } sub InARmark { return <<'END'; 064B 0652 0670 END } sub InARshortvowel { return <<'END'; 064B 0650 END } 1;
In reply to Re: Creating new character classes for foreign languages
by graff
in thread Creating new character classes for foreign languages
by Polyglot
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |