Polyglot has asked for the wisdom of the Perl Monks concerning the following question:
I just signed up on PAUSE, and am finally willing to submit my first modules. Everyone recommends that newbies get advice on how to do this, especially as pertains the naming of the modules, so I present the matter here for your inspection. Having tried to research the matter, I'm a little conflicted about which category these modules would best fit, so your advice is much appreciated.
The synopsis is that these are very basic modules with respect to handling of the Thai/Lao character sets. In Thai and in Lao, each character/codepoint can have one or more categorizations, like vowel/consonant and uppercase/lowercase in English, but more complex. The current Unicode pragma available allows only the /\p{InThai}/ method of identification, so my module expands the regexp tokens to include such as:
This is a module that will be useful in any textual manipulation, such as word/syllable identification or splitting (words are not normally split with whitespace as in English). It is a very simple module, whose features may be amended/augmented in the future with some additional capability, but whose present utility is readily apparent.
Now, for the code example....I'll present the Thai one, but the Lao is nearly the same, but on the Lao charset.
package Regexp::Thai::CharClasses; use 5.008003; use strict; use warnings; require Exporter; our $VERSION = '1.01'; our @ISA = qw(Exporter); our @EXPORT = qw( InThai InThaiCons InThaiHCons InThaiMCons InThaiLCons InThaiVowel InThaiPreVowel InThaiPostVowel InThaiCompVowel InThaiDigit InThaiTon +e InThaiPunct ); =head1 NAME Regexp::Thai::CharClasses - useful character properties for Unicode T +hai =head1 SYNOPSIS use Regexp::Thai::CharClasses; $char = "..."; # some UTF8 string $char =~ /\p{InThaiCons}/; # match a Thai consonant $char =~ /\p{InThaiTone}/; # match a Thai tone mark # see description for full set of terms =head1 DESCRIPTION This module supplements the Unicode character-class definitions with special groups relevant to Thai linguistics. The following classes are defined: =over 4 =item InThai Matches ALL characters in the Thai unicode code-point range. =item InThaiCons Matches Thai consonant letters, leaving out vowels, numerics, tone mar +ks, etc. =item InThaiVowel Matches Thai vowels, including compounded and free-standing vowels. NOTE: Exceptions here include several of the "consonants" which also s +erve as vowels: or-ang, yo-yak, double ro-reua, leut and reut, and wo-wen. + These are included as vowels in this grouping to accept the widest pos +sible definition, but cannot with certainty be determined by this to be in u +se as actual vowels in the instance of their identification here. =item InThaiAlpha Matches only the Thai alphabetic characters (consonants and vowels), excluding all digits, tone marks, and punctuation marks. =item InThaiTone Matches only the Thai tone marks, leaving out all letters, digits and punctuation marks. =item InThaiPunct Matches Thai punctuation characters, not including tone marks, white space, digits or alphabetic characters, and not including non-Thai punctuation marks (such as English [.,'"!?] etc.). =item InThaiCompVowel Matches only the Thai vowels which are compounded with a Thai consonan +t, and matching only the vowel portion of the compounded character. =item InThaiPreVowel Matches only the subset of vowels which appear _before_ the consonant with which they are associated (though in Thai they are sounded _after +_ said consonant); this excludes all consonant-vowels and does not inclu +de any of the compounded vowels. =item InThaiPostVowel Matches only the vowels which appear _after_ the consonant with which they are associated; this excludes all consonant-vowels and does not include any of the compounded vowels. =item InThaiHCons Matches high-class Thai consonants. =item InThaiMCons Matches middle-class Thai consonants. =item InThaiLCons Matches low-class Thai consonants. =item InThaiDigit Matches Thai numerical digits only. =back =cut sub InThai { return <<'END'; 0E01 0E5B END } sub InThaiCons { return <<'END'; 0E01 0E2E END } sub InThaiVowel { return join "\n", '0E30 0E45', '0E47',#Thai semi-tone mark used above gor-gai in Thai "gor" (or) '0E4D', '0E22',#Thai consonant yo-yak can also be a vowel (like 'y' in English +) '0E2D',#Thai consonant or-ang can also be a vowel '0E27',#Thai consonant wo-wen is only a vowel following mai han-akat } sub InThaiAlpha { return <<'END'; 0E01 0E2E 0E30 0E45 0E47 0E4D 0E22 0E2D 0E27 END } sub InThaiTone { return <<'END'; 0E48 0E4B END } sub InThaiPunct { return <<'END'; 0E46 0E4C 0E4E 0E4F 0E5A 0E5B END } sub InThaiCompVowel { return join "\n", '0E31',#Thai mai han-akat '0E34',#Thai sara-i '0E35',#Thai sara-ii '0E36',#Thai sara-ue '0E37',#Thai sara-uee '0E38',#Thai sara-u '0E39',#Thai sara-uu '0E3A',#Thai phinthu '0E47',#Thai semi-tone mark used above gor-gai in Thai "gor" (or) } sub InThaiPreVowel { return <<'END'; 0E40 0E44 END } sub InThaiPostVowel { return <<'END'; 0E45 0E30 0E32 0E33 END } sub InThaiHCons { return <<'END'; 0E02 0E03 0E09 0E10 0E16 0E1C 0E1D 0E28 0E29 0E2A 0E2B END } sub InThaiMCons { return <<'END'; 0E01 0E08 0E0E 0E0F 0E14 0E15 0E1A 0E1B 0E2D END } sub InThaiLCons { return <<'END'; 0E04 0E07 0E0A 0E0D 0E11 0E13 0E17 0E19 0E1E 0E27 0E2C 0E2E END } sub InThaiDigit { return <<'END'; 0E50 0E59 END } =head1 AUTHOR Erik Mundall =head1 COPYRIGHT Copyright (C) 2015 Erik Mundall. All Rights Reserved. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut 1;
For names, I've considered Lingua and some others, but this is so directly Regexp related as to make me think it might better live there. I'm fully open to suggestions. As an entirely self-taught coder who is only a hobbyist at it, and a teacher by trade, I'm also open to corrections on the code itself. Regarding the "Export" feature, I know that it is deprecated to export all the functions, but I just cannot visualize the need to separate these out--like, how often would someone want to know only the vowels, and, if so, how much would be gained by specifying only such? The added complexity, versus the matter of namespace, seems to my mind to be a net disadvantage considering the namespace here is very specific as it is and unlikely to present a problem. Yet I will readily listen to those of greater experience.
LATEST UPDATE:Suggested names so far have included:
At this point, I've updated the name of the package above to reflect what I am most strongly leaning toward, a slight modification of the suggestions presented in the list above: Regexp::Thai::CharClasses. The floor is still open for suggestions.
Thank you for your help.
Blessings,
~Polyglot~
|
---|