in reply to Re^4: Namespace/advice for new CPAN modules for Thai & Lao ( Regexp::CharProps::Thai )
in thread Namespace/advice for new CPAN modules for Thai & Lao

So maybe

#!/usr/bin/perl -- =head1 NAME Regexp::CharProps - User Defined Character Properties =head1 SYNOPSIS use Regexp::CharProps qw/ Thai /; ## like use Regexp::CharProps::T +hai qw/ :all /; ## imports all exports like sub +InThaiPunct... print "\$_ has got Thai" if m{ \p{InThai} |\p{InThaiCons} |\p{InThaiHCons} |\p{InThaiMCons} |\p{InThaiLCons} |\p{InThaiVowel} |\p{InThaiPreVowel} |\p{InThaiPostVowel} |\p{InThaiCompVowel} |\p{InThaiDigit} |\p{InThaiTone} |\p{InThaiPunct} }x; use Regexp::CharProps::Thai qw/ InThaiPunct /; ## not import :all +just sub InThaiPunct print "got Thai punctuation\n" if m/\p{InThaiPunct}/; =cut package Regexp::CharProps; sub import { my( $class, @modsforall ) = @_; return if not @modsforall; my $target = scalar caller; require Import::Into; for my $mod( @modsforall ){ my $package = $class."::".$mod; $package ->import::into( $target , ':all' ); } } 1;

with accompanying EXPORT_TAGS

package Regexp::CharProps::Thai; use 5.008003; use strict; use warnings; require Exporter; our $VERSION = '1.01'; our @ISA = qw(Exporter); our @EXPORT_OK = qw( InThai InThaiCons InThaiHCons InThaiMCons InThaiLCons InThaiVowel InThaiPreVowel InThaiPostVowel InThaiCompVowel InThaiDigit InThaiTon +e InThaiPunct ); our %EXPORT_TAGS = ( 'all' => [ @EXPORT_OK ] ); =head1 NAME Thai - useful character properties for Unicode Thai =head1 SYNOPSIS use Regexp::CharProps::Thai; $char = "..."; # some UTF8 string $char =~ /\p{InThaiCons}/; # match a Thai consonant $char =~ /\p{InThaiTone}/; # match a Thai tone mark # see description for full set of terms =head1 DESCRIPTION This module supplements the Unicode character-class definitions with special groups relevant to Thai linguistics. The following classes are defined: =over 4 =item InThai Matches ALL characters in the Thai unicode code-point range. =item InThaiCons Matches Thai consonant letters, leaving out vowels, numerics, tone mar +ks, etc. =item InThaiVowel Matches Thai vowels, including compounded and free-standing vowels. NOTE: Exceptions here include several of the "consonants" which also s +erve as vowels: or-ang, yo-yak, double ro-reua, leut and reut, and wo-wen. + These are included as vowels in this grouping to accept the widest pos +sible definition, but cannot with certainty be determined by this to be in u +se as actual vowels in the instance of their identification here. =item InThaiAlpha Matches only the Thai alphabetic characters (consonants and vowels), excluding all digits, tone marks, and punctuation marks. =item InThaiTone Matches only the Thai tone marks, leaving out all letters, digits and punctuation marks. =item InThaiPunct Matches Thai punctuation characters, not including tone marks, white space, digits or alphabetic characters, and not including non-Thai punctuation marks (such as English [.,'"!?] etc.). =item InThaiCompVowel Matches only the Thai vowels which are compounded with a Thai consonan +t, and matching only the vowel portion of the compounded character. =item InThaiPreVowel Matches only the subset of vowels which appear _before_ the consonant with which they are associated (though in Thai they are sounded _after +_ said consonant); this excludes all consonant-vowels and does not inclu +de any of the compounded vowels. =item InThaiPostVowel Matches only the vowels which appear _after_ the consonant with which they are associated; this excludes all consonant-vowels and does not include any of the compounded vowels. =item InThaiHCons Matches high-class Thai consonants. =item InThaiMCons Matches middle-class Thai consonants. =item InThaiLCons Matches low-class Thai consonants. =item InThaiDigit Matches Thai numerical digits only. =back =cut sub InThai { return <<'END'; 0E01 0E5B END } sub InThaiCons { return <<'END'; 0E01 0E2E END } sub InThaiVowel { return join "\n", '0E30 0E45', '0E47',#Thai semi-tone mark used above gor-gai in Thai "gor" (or) '0E4D', '0E22',#Thai consonant yo-yak can also be a vowel (like 'y' in English +) '0E2D',#Thai consonant or-ang can also be a vowel '0E27',#Thai consonant wo-wen is only a vowel following mai han-akat } #+Thai::InThaiCons #+Thai::InThaiVowel sub InThaiAlpha { return <<'END'; 0E01 0E2E 0E30 0E45 0E47 0E4D 0E22 0E2D 0E27 END } sub InThaiTone { return <<'END'; 0E48 0E4B END } sub InThaiPunct { return <<'END'; 0E46 0E4C 0E4E 0E4F 0E5A 0E5B END } sub InThaiCompVowel { return join "\n", '0E31',#Thai mai han-akat '0E34',#Thai sara-i '0E35',#Thai sara-ii '0E36',#Thai sara-ue '0E37',#Thai sara-uee '0E38',#Thai sara-u '0E39',#Thai sara-uu '0E3A',#Thai phinthu '0E47',#Thai semi-tone mark used above gor-gai in Thai "gor" (or) } sub InThaiPreVowel { return <<'END'; 0E40 0E44 END } sub InThaiPostVowel { return <<'END'; 0E45 0E30 0E32 0E33 END } sub InThaiHCons { return <<'END'; 0E02 0E03 0E09 0E10 0E16 0E1C 0E1D 0E28 0E29 0E2A 0E2B END } sub InThaiMCons { return <<'END'; 0E01 0E08 0E0E 0E0F 0E14 0E15 0E1A 0E1B 0E2D END } #+Thai::InThaiCons #-Thai::InThaiHcons #-Thai::InThaiMCons sub InThaiLCons { return <<'END'; 0E04 0E07 0E0A 0E0D 0E11 0E13 0E17 0E19 0E1E 0E27 0E2C 0E2E END } sub InThaiDigit { return <<'END'; 0E50 0E59 END } =head1 AUTHOR Erik Mundall =head1 COPYRIGHT Copyright (C) 2015 Erik Mundall. All Rights Reserved. This is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut 1;
  • Comment on Re^5: Namespace/advice for new CPAN modules for Thai & Lao ( Regexp::CharProps - User Defined Character Properties )
  • Select or Download Code

Replies are listed 'Best First'.
Re^6: Namespace/advice for new CPAN modules for Thai & Lao ( Regexp::CharProps - User Defined Character Properties )
by Polyglot (Chaplain) on Mar 24, 2015 at 05:00 UTC
    "but he meant some unicode string"

    Yes. It definitely wouldn't work on an upper-ascii-type encoding such as Thai originally began with, without some form of encoding/decoding going on. I guess I put "UTF8" because that is what gets used most with Thai, and what I knew would work having developed strictly with that. I presume any Unicode type should work equally well, though I don't claim to be an expert on Unicode.

    In your code example:

    print "\$_ has got Thai" if m{ \p{InThai} |\p{InThaiCons} |\p{InThaiHCons} |\p{InThaiMCons} |\p{InThaiLCons} |\p{InThaiVowel} |\p{InThaiPreVowel} |\p{InThaiPostVowel} |\p{InThaiCompVowel} |\p{InThaiDigit} |\p{InThaiTone} |\p{InThaiPunct} }x;
    ...only the first item in the OR'ed list should ever see action. All of the subsequent categories are already "InThai", and the "InThai" token already comes standard with Perl, AFAIK (see pg. 172 of "Programming Perl, Third Edition"), so that code would do little to test additional functionality. If the first line (\p{InThai}) failed, none of the others should succeed either.

    NOTE: I've updated my list to reflect your proposed name, but I've adapted it slightly to one that seems a better fit to me.

    Blessings,

    ~Polyglot~

      In your code example: ...only the first item in the OR'ed list should ever see action...

      SYNOPSIS only shows whats possible, it can be repetitive and incorrect as long as the syntax is valid. And when the exports are few, might as well show them all instead of "..."

      Regexp::CharProps

      My suggestion was that you call yours Regexp::CharProps::Thai not Regexp::CharProps. Also to distribute a helper parent module Regexp::CharProps with it, so that others can add Regexp::CharProps::AnonyRands or whatever ... a new well named place for these definitions to live

      Regexp::Thai::CharClasses

      So are you're going to have more Thai Regexp's that aren't CharSlasses?

      I think you got that backwards, it should be Regexp::CharClasses::Thai :)

      Or it should go into Lingua::Thai::RegexpCharClasses? In case you're going to have more Lingua::Thai things that aren't RegexpCharClasses

      Yes. It definitely wouldn't work on an upper-ascii-type encoding such as Thai originally began with, without some form of encoding/decoding going on. I guess I put "UTF8" because that is what gets used most with Thai, and what I knew would work having developed strictly with that. I presume any Unicode type should work equally well, though I don't claim to be an expert on Unicode.

      Right :) the numbers are unicode code points , independent of encoding

        SYNOPSIS only shows whats possible, it can be repetitive and incorrect as long as the syntax is valid. And when the exports are few, might as well show them all instead of "..."

        Thank you for the clarification. I guess I misunderstood the intent of that. As is obvious, I've never submitted a module before, so I appreciate your patience with me.

        My suggestion was that you call yours Regexp::CharProps::Thai not Regexp::CharProps.

        Ok, I fixed that.

        Also to distribute a helper parent module Regexp::CharProps with it, so that others can add Regexp::CharProps::AnonyRands or whatever ... a new well named place for these definitions to live

        I have no idea how to do this.

        So are you're going to have more Thai Regexp's that aren't CharSlasses?
        I think you got that backwards, it should be Regexp::CharClasses::Thai :)

        Looking at that module now, perhaps it could all just go into Regexp::CharClasses, but I'm not the developer for that, and when I looked at its code, it's done in a somewhat different style which is confusing to me. I don't see any logical difference between Regexp::CharClasses::Thai and Regexp::Thai::CharClasses, except that, to my understanding, the former would be inhibited by the fact another developer has already used the Regexp::CharClasses namespace. Am I missing something here?

        Blessings,

        ~Polyglot~