Re: Listing out the characters included in a character class
by kcott (Archbishop) on Oct 27, 2023 at 11:40 UTC
|
G'day Polyglot,
"I think they're called character classes, but please correct me if I've misunderstood"
Both perlrecharclass
and perlrebackslash
have information on character classes.
Your InThaiHCons() and InThaiLCons() seem overcomplicated.
See my example module below.
"The site would not print the UTF-8 characters"
Use <pre> tags for blocks containing characters outside the 7-bit ASCII range.
I also use <tt> tags for individual, inline items.
I'm providing you with some sample code which hopefully will get you started.
ken@titan ~/tmp/pm_11155205_uni_char_class
$ ls -l
total 2
-rw-r--r-- 1 ken None 590 Oct 27 22:12 PolyUniCharClass.pm
-rwxr-xr-x 1 ken None 404 Oct 27 22:13 uni_char_class.pl
Here's the module code:
package PolyUniCharClass;
use strict;
use warnings;
{
my %char_class_despatch = (
InThaiHCons => \&InThaiHCons,
InThaiLCons => \&InThaiLCons,
);
sub list {
my ($char_class) = @_;
unless (exists $char_class_despatch{$char_class}) {
warn "Char class '$char_class' doesn't exist!\n";
return [];
}
return [
map chr hex, @{$char_class_despatch{$char_class}->()}
];
}
}
sub InThaiHCons {
return [qw{0E02 0E03 0E09}];
}
sub InThaiLCons {
return [qw{0E04 0E07 0E0A}];
}
1;
Here's the script:
#!/usr/bin/env perl
use strict;
use warnings;
use open OUT => qw{:encoding(UTF-8) :std};
use lib '.'; # DEMO ONLY -- DON'T use in PRODUCTION!
use PolyUniCharClass;
print "InThaiHCons:\n";
print @{PolyUniCharClass::list('InThaiHCons')}, "\n";
print "InThaiLCons:\n";
print @{PolyUniCharClass::list('InThaiLCons')}, "\n";
print "InThaiMCons:\n";
print @{PolyUniCharClass::list('InThaiMCons')}, "\n";
Here's a run (note <pre> tags):
ken@titan ~/tmp/pm_11155205_uni_char_class
$ ./uni_char_class.pl
InThaiHCons:
ขฃฉ
InThaiLCons:
คงช
InThaiMCons:
Char class 'InThaiMCons' doesn't exist!
| [reply] [d/l] [select] |
|
|
This was the best response yet...and the voters seem to agree. Thank you.
Your InThaiHCons() and InThaiLCons() seem overcomplicated.
There are two nuances to this which you may not have grasped: 1) The double-column codepoints in the 'InThaiLCons' indicate ranges, i.e. the '0E04 0E07' line will actually return '0E04 0E05 0E06 0E07'; and 2) I have formatted the 'InThaiHCons' as I have in order to be able to indicate in the markup what the codepoints represent. It's hard to look at a codepoint and just remember which character it is for, and as the code maintainer, this association helps me tremendously, especially for certain characters. However, I am considering removing those comments for the sake of code brevity and tidiness before releasing the module to CPAN, which I fully intend to do soon, having delayed years already in doing so due to my own lack of confidence (this will be a first for me).
That said, in my quest for methods to do what I want done, I discovered that the subroutines can be called in the code in a different context than that of a regular expression, and they will, themselves, return the codepoints I desire. However, they do not preserve the double-columnness demonstrated by the 'InThaiLCons' of my example, simply putting all the codepoints in a straight list--so I have decided not to use those ranges, despite their obvious efficiency, and just list every single codepoint. This solves a couple problems at once, with only the problem of increasing the visible size of the lists (i.e. more code). So, my new 'InThaiLCons' would look like this:
sub InThaiLCons {
return join "\n",
'0E04', '0E05', '0E06', '0E07', '0E0A', '0E0B', '0E0C', '0E0D',
'0E11', '0E12', '0E13', '0E17', '0E18', '0E19', '0E1E', '0E1F',
'0E20', '0E21', '0E22', '0E23', '0E24', '0E25', '0E26', '0E27',
'0E2C', '0E2E',
}
However, after your suggestions, that can be more efficiently represented as:
sub InThaiLCons {
return [qw{
0E04 0E05 0E06 0E07 0E0A 0E0B 0E0C 0E0D
0E11 0E12 0E13 0E17 0E18 0E19 0E1E 0E1F
0E20 0E21 0E22 0E23 0E24 0E25 0E26 0E27
0E2C 0E2E }]
}
I have a new problem, in that I want to use two names for each of these subroutines: i.e. 'InThai...' and 'IsThai...'. Essentially, they appear to be synonymous for many current usages, and I wish for either of these forms to be acceptable with this new functionality as well. So, must I repeat the entire subroutine in the code, changing only its name? or is there a way to alias it to another name?
Regarding the use of <pre> tags, are they equivalent to the <code> tags? I had put the UTF8 characters in a <code> block, and they got converted to ugly HTML-entities. That's why I moved them to outside of that block.
Incidentally, there will indeed also be an 'InThaiMCons' definition in this module (and more)!
| [reply] [d/l] [select] |
|
|
#!/usr/bin/env perl
use strict;
use warnings;
sub hello { print "Hello $_[0]!\n"; }
*hi = \&hello;
sub bonjour { &hello }
hello ('there');
hi ('world');
bonjour ('Alain');
It's usually cleaner and clearer just to have the one name for any given piece of functionality, however.
| [reply] [d/l] |
|
|
|
|
You can still keep ranges. There are better ways to represent them; see code below.
You can represent Unicode names against individual codepoints;
it will become somewhat difficult and possibly messy for ranges of codepoints.
I recommend that you have Unicode PDF Character Code Chart
"Thai -- Range: 0E00-0E7F"
at hand when developing; this sequentially lists the codepoints, their glyphs, their names,
and some entries have additional notes.
You might consider adding that link to your module's POD.
If you're writing code for other (Unicode) scripts,
you can find links to all of the current charts at
"Unicode 15.1 Character Code Charts".
Having multiple names for the same subroutine is often confusing and generally, in my opinion, a design flaw;
however, it's easily achieved with additional keys in the despatch table.
I would urge you to reconsider if that's something you really need.
Update:
I've just posted and saw your reply to hippo.
Given your explanation, use of multiple names seems valid in this instance.
New script and Module:
ken@titan ~/tmp/pm_11155205_uni_char_class
$ ls -l *2*
-rw-r--r-- 1 ken None 1275 Oct 29 01:50 PolyUniCharClass2.pm
-rwxr-xr-x 1 ken None 370 Oct 29 01:42 uni_char_class_2.pl
uni_char_class_2.pl:
#!/usr/bin/env perl
use strict;
use warnings;
use open OUT => qw{:encoding(UTF-8) :std};
use lib '.'; # DEMO ONLY -- DON'T use in PRODUCTION!
use PolyUniCharClass2;
for my $prefix (qw{In Is If}) {
for my $class (qw{H L M}) {
my $cons = "${prefix}Thai${class}Cons";
print "$cons:\n";
print @{PolyUniCharClass2::list($cons)}, "\n";
}
}
PolyUniCharClass2.pm:
package PolyUniCharClass2;
use strict;
use warnings;
{
my %char_class_despatch = (
InThaiHCons => \&InThaiHCons,
InThaiLCons => \&InThaiLCons,
IsThaiHCons => \&InThaiHCons,
IsThaiLCons => \&InThaiLCons,
);
sub list {
my ($char_class) = @_;
unless (exists $char_class_despatch{$char_class}) {
warn "Char class '$char_class' doesn't exist!\n";
return [];
}
return [map chr, @{$char_class_despatch{$char_class}->()}];
}
}
{
my $ThaiHCons = [qw{0E02-0E03 0E09 0E10 0E16}];
my $ThaiLCons = [qw{0E04-0E07 0E0A-0E0D 0E11}];
my %ThaiCons_expanded;
sub InThaiHCons {
return $ThaiCons_expanded{InThaiHCons} ||= _expand($ThaiHCons)
+;
}
sub InThaiLCons {
return $ThaiCons_expanded{InThaiLCons} ||= _expand($ThaiLCons)
+;
}
}
{
my $re = qr{^([0-9A-Fa-f]+)-([0-9A-Fa-f]+)$};
sub _expand {
my ($code_range_list) = @_;
my @full_list;
for my $range (@$code_range_list) {
if ($range =~ $re) {
push @full_list, hex($1) .. hex($2);
}
else {
push @full_list, hex $range;
}
}
return [@full_list];
}
}
1;
Output:
$ ./uni_char_class_2.pl
InThaiHCons:
ขฃฉฐถ
InThaiLCons:
คฅฆงชซฌญฑ
InThaiMCons:
Char class 'InThaiMCons' doesn't exist!
IsThaiHCons:
ขฃฉฐถ
IsThaiLCons:
คฅฆงชซฌญฑ
IsThaiMCons:
Char class 'IsThaiMCons' doesn't exist!
IfThaiHCons:
Char class 'IfThaiHCons' doesn't exist!
IfThaiLCons:
Char class 'IfThaiLCons' doesn't exist!
IfThaiMCons:
Char class 'IfThaiMCons' doesn't exist!
There are a number of improvements you could make to the module code depending on the Perl version you're targeting.
You didn't indicate your Perl version.
The code I've presented should, I believe, work fine with Perl 5.6 (but I have no way to check that).
| [reply] [d/l] [select] |
|
|
In my last response, I believe I covered all of the coding issues.
I finished with:
"There are a number of improvements you could make to the module code depending on the Perl version you're targeting. ... The code I've presented should, I believe, work fine with Perl 5.6 ..."
Perl does a great job of keeping up with Unicode versions.
The latest Unicode version is 15.1; Perl v5.38 (the latest stable version) supports Unicode 15.0
(see "perl5380delta: Unicode 15.0 is supported").
Writing your code for Perl 5.6 may be insufficient to handle the Unicode support you need;
look through the deltas to find the minimum Perl version for your needs.
Partly because it was a fun task for me, but also to show you some of the improvements you could get from a later version,
here's the code rewritten for Perl v5.38 and Unicode 15.0.
New script and module:
ken@titan ~/tmp/pm_11155205_uni_char_class
$ ls -l *3*
-rw-r--r-- 1 ken None 993 Oct 29 05:03 PolyUniCharClass3.pm
-rwxr-xr-x 1 ken None 344 Oct 29 05:03 uni_char_class_3.pl
uni_char_class_3.pl:
#!/usr/bin/env perl
use v5.38;
use open OUT => qw{:encoding(UTF-8) :std};
use lib '.'; # DEMO ONLY -- DON'T use in PRODUCTION!
use PolyUniCharClass3;
for my $prefix (qw{In Is If}) {
for my $class (qw{H L M}) {
my $cons = "${prefix}Thai${class}Cons";
say "$cons:";
say PolyUniCharClass3::list($cons)->@*;
}
}
PolyUniCharClass3.pm:
package PolyUniCharClass3;
use v5.38;
sub list ($char_class) {
state $valid_char_class = {map +($_ => 1), qw{
InThaiHCons IsThaiHCons InThaiLCons IsThaiLCons
}};
unless (exists $valid_char_class->{$char_class}) {
warn "Char class '$char_class' doesn't exist!\n";
return [];
}
return [map chr, ThaiCons(substr $char_class, 2)->@*];
}
sub ThaiCons ($cons) {
state $code_ranges = {
ThaiHCons => [qw{0E02-0E03 0E09 0E10 0E16}],
ThaiLCons => [qw{0E04-0E07 0E0A-0E0D 0E11}],
};
state $ThaiCons_expanded;
return $ThaiCons_expanded->{$cons} //= _expand($code_ranges->{$con
+s});
}
sub _expand ($code_range_list) {
state $re = qr{^([0-9A-Fa-f]+)-([0-9A-Fa-f]+)$};
my @full_list;
for my $range ($code_range_list->@*) {
if ($range =~ $re) {
push @full_list, hex($1) .. hex($2);
}
else {
push @full_list, hex $range;
}
}
return [@full_list];
}
Output (unchanged):
ken@titan ~/tmp/pm_11155205_uni_char_class
$ ./uni_char_class_3.pl
InThaiHCons:
ขฃฉฐถ
InThaiLCons:
คฅฆงชซฌญฑ
InThaiMCons:
Char class 'InThaiMCons' doesn't exist!
IsThaiHCons:
ขฃฉฐถ
IsThaiLCons:
คฅฆงชซฌญฑ
IsThaiMCons:
Char class 'IsThaiMCons' doesn't exist!
IfThaiHCons:
Char class 'IfThaiHCons' doesn't exist!
IfThaiLCons:
Char class 'IfThaiLCons' doesn't exist!
IfThaiMCons:
Char class 'IfThaiMCons' doesn't exist!
There were a couple of points at the end of your post which I didn't address. Here goes:
"Regarding the use of <pre> tags, are they equivalent to the <code> tags?"
They sort of do the same job but have these differences:
-
Unicode characters will be rendered instead of the entity references you get with <code> tags.
After previewing, you may see entity references in the textarea where you're typing,
but the preview itself should show the characters (assuming you have appropriate fonts to display them).
-
You won't get a [download] link at the end of the block.
-
You won't get code wrapping: <code> tags will break long lines,
starting wrapped lines with a prominent + (by default, it's red).
Because of this, aim to keep lines short.
-
You'll need to handle special characters yourself; e.g. writing [ instead of [.
See "site how to: Submitting Code and Escaping Characters" for more about that.
"Incidentally, there will indeed also be an 'InThaiMCons' definition in this module (and more)!"
I picked the names like If* and *M* for my testing.
Your test suite (t/*.t scripts) should check that both success and failure are handled appropriately.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
By the way, I discovered that the shortened form of the subroutine did not work, and have reverted to my previous format for returning the codepoints.
This is the sort of error I got from it...
Can't find Unicode property definition "ARRAY(0x55ff865ace80)" in expa
+nsion of main::IsThai
| [reply] [d/l] |
Re: Listing out the characters included in a character class
by soonix (Chancellor) on Oct 27, 2023 at 08:14 UTC
|
Perhaps unichars from Unicode::Tussle ("Tom's Unicode Scripts So Life is Easier") is what you want (or helps you come near that)? | [reply] |
Re: Listing out the characters included in a character class
by Tux (Canon) on Oct 30, 2023 at 10:44 UTC
|
$ perl -COE -wE'use charnames (); use unicore::Name; my $script = shif
+t; open my $fh, "<", $INC{"unicore/Name.pm"} =~ s/Name\.pm$/Blocks.tx
+t/r; for (grep m/\b$script\b/ => <$fh>) { m/\b([0-9A-Z]+)\b\s*\.\.\s*
+([0-9A-Z]+)/ or next; foreach my $c (hex ($1) .. hex ($2)) { my $nm =
+ charnames::viacode ($c) or next;printf "U+%05x %-5s %s\n", $c, chr (
+$c), $nm}}' Thai
U+00e01 ก THAI CHARACTER KO KAI
U+00e02 ข THAI CHARACTER KHO KHAI
U+00e03 ฃ THAI CHARACTER KHO KHUAT
U+00e04 ค THAI CHARACTER KHO KHWAI
U+00e05 ฅ THAI CHARACTER KHO KHON
U+00e06 ฆ THAI CHARACTER KHO RAKHANG
:
:
U+00e58 ๘ THAI DIGIT EIGHT
U+00e59 ๙ THAI DIGIT NINE
U+00e5a ๚ THAI CHARACTER ANGKHANKHU
U+00e5b ๛ THAI CHARACTER KHOMUT
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] |
Re: Listing out the characters included in a character class
by Anonymous Monk on Oct 27, 2023 at 07:57 UTC
|
| [reply] |
Re: Listing out the characters included in a character class
by NERDVANA (Priest) on Oct 31, 2023 at 10:27 UTC
|
use Mock::Data 'charset';
my $thaiLCons= charset(
codepoint_ranges => [
0x0E04, 0x0E07,
0x0E0A, 0x0E0D,
0x0E11, 0x0E13,
0x0E17, 0x0E19,
0x0E1E, 0x0E27,
],
codepoints => [ 0x0E2C, 0x0E2E ],
);
my @characters= $thaiLCons->members->@*;
Or in a one-liner!
perl -E 'use Mock::Data "charset"; my $c= charset("\x{E2C}\x{E2E}\x{E04}-\x{E07}\x{E0A}-\x{E0D}\x{E11}-\x{E13}\x{E17}-\x{E19}\x{E1E}-\x{E27}"); use DDP; &p($c->members);'
See the rest of the examples for Mock::Data::Charset
In short, the inversion lists from Unicode::UCD are the *design* you want, but that module doesn't give you utilities to work *with* inversion lists, it just gives you the lists from the unicode standard. You need to write a bunch of code like I did in Mock::Data::Charset. | [reply] [d/l] [select] |
Re: Listing out the characters included in a character class
by Anonymous Monk on Oct 27, 2023 at 08:20 UTC
|
| [reply] |
Re: Listing out the characters included in a character class
by Polyglot (Chaplain) on Nov 01, 2023 at 07:12 UTC
|
Presently, after much work, and after cutting out much of the code to make it shorter (per Hippo's sage advice), the script runs, but still logs some errors. I've narrowed the problem down, but am still mystified as to the reason for it. I've even done a full server restart, and yet the errors persist. Here is the script's output to the browser (using "pre" tags for proper UTF8 rendering):
Content-Language: utf8;
Checking the Thai module
ok 1 - use RegexpCharClassesThai;
Positives...ok 2 - Match for "ก" =~ /\p{IsKokai}/
not ok 3 - Match for "ก" =~ /\p{InThaiMCons}/
Negatives...ok 4 - No match for "ก" =~ /\p{InThaiHCons}/
ok 5 - No match for "ก" =~ /\p{InThaiLCons}/
ok 6 - No match for "ก" =~ /\p{InThaiVowel}/
ok 7 - No match for "ก" =~ /\p{InThaiPreVowel}/
Positives...not ok 8 - Match for "ไ" =~ /\p{InThaiVowel}/
ok 9 - Match for "ไ" =~ /\p{InThaiPreVowel}/
Negatives...ok 10 - No match for "ไ" =~ /\p{InThaiHCons}/
ok 11 - No match for "ไ" =~ /\p{InThaiMCons}/
ok 12 - No match for "ไ" =~ /\p{InThaiLCons}/
Check:
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
INC: /var/www/lib/ /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.34.0 /usr/local/share/perl/5.34.0 /usr/lib/x86_64-linux-gnu/perl5/5.34 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.34 /usr/share/perl/5.34 /usr/local/lib/site_perl /etc/apache2
Here is the output to the log file:
[Wed Nov 01 06:37:06.123037 2023] [core:error] [pid 754:tid 1397026418
+29440] [client 192.168.1.101:58127] Premature end of script headers:
+test-thai-mod.pl
[Wed Nov 01 06:37:06.123072 2023] [perl:warn] [pid 754:tid 13970264182
+9440] /cgi/test-thai-mod.pl did not send an HTTP header
Wide character in print at /usr/share/perl/5.34/Test2/Formatter/TAP.pm
+ line 125.
# Failed test ' Match for "ก" =~ /\p{InThaiMCons}/<br>'
# at /var/www/cgi/test-thai-mod.pl line 32.
# got: ''
# expected: '1'
Wide character in print at /usr/share/perl/5.34/Test2/Formatter/TAP.pm
+ line 125.
# Failed test ' Match for "ไ" =~ /\p{InThaiVowel}/<br>'
# at /var/www/cgi/test-thai-mod.pl line 43.
# got: ''
# expected: '1'
[Wed Nov 01 06:38:29.720987 2023] [core:error] [pid 753:tid 1397026418
+29440] [client 192.168.1.101:58138] Premature end of script headers:
+test-thai-mod.pl
[Wed Nov 01 06:38:29.721022 2023] [perl:warn] [pid 753:tid 13970264182
+9440] /cgi/test-thai-mod.pl did not send an HTTP header
Wide character in print at /usr/share/perl/5.34/Test2/Formatter/TAP.pm
+ line 125.
# Failed test ' Match for "ก" =~ /\p{InThaiMCons}/<br>'
# at /var/www/cgi/test-thai-mod.pl line 32.
# got: ''
# expected: '1'
Wide character in print at /usr/share/perl/5.34/Test2/Formatter/TAP.pm
+ line 125.
# Failed test ' Match for "ไ" =~ /\p{InThaiVowel}/<br>'
# at /var/www/cgi/test-thai-mod.pl line 43.
# got: ''
# expected: '1'
And here is the MODULE code:
package RegexpCharClassesThai;
use 5.008003;
use strict;
use warnings;
use utf8;
use Exporter;
our @ISA = qw(Exporter);
our %EXPORT_TAGS = (
classes =>
[ qw(InThaiHCons InThaiMCons InThaiLCons
InThaiVowel InThaiPreVowel
IsThaiHCons IsThaiMCons IsThaiLCons
IsThaiVowel IsThaiPreVowel) ],
characters =>
[ qw(InKokai InKhokhai
IsKokai IsKhokhai ) ],
);
# add all the other ":class" tags to the ":all" class,
# deleting duplicates
{
my %seen;
push @{$EXPORT_TAGS{all}},
grep {!$seen{$_}++} @{$EXPORT_TAGS{$_}} foreach keys %EXPORT_TAGS;
}
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our @EXPORT = ( @{ $EXPORT_TAGS{'classes'} } );
our $VERSION = '1.00';
#--------------------------------------------------------------
# CREATE FUNCTIONALITY FOR SHOWING CONTENTS OF EACH CLASS
#--------------------------------------------------------------
my %char_class_dispatch = (
InThaiHCons => \&InThaiHCons,
InThaiMCons => \&InThaiMCons,
InThaiLCons => \&InThaiLCons,
InThaiVowel => \&InThaiVowel,
InThaiPreVowel => \&InThaiPreVowel,
);
sub list {
my ($char_class) = @_;
unless (exists $char_class_dispatch{$char_class}) {
warn "Char class '$char_class' doesn't exist!\n";
return [];
}
return [
map chr hex, @{$char_class_dispatch{$char_class}->()}
];
}
#--------------------------------------------------------------
# Start with the "Is..." versions
#--------------------------------------------------------------
sub IsThaiHCons { #THAI HIGH-CLASS CONSONANTS
# ข ฃ ฉ ฐ ถ ผ ฝ ศ ษ ส ห
return qw{
0E02 0E03 0E09 0E10 0E16 0E1C 0E1D 0E28
0E29 0E2A 0E2B
}
}
sub IsThaiMCons { #THAI MID-CLASS CONSONANTS
# ก จ ฎ ฏ ด ต บ ป อ
return qw{
0E01 0E08 0E0E 0E0F 0E14 0E15 0E1A 0E1B
0E2D
}
}
sub IsThaiLCons { #THAI LOW-CLASS CONSONANTS
# ค ฅ ฆ ง ช ซ ฌ ญ ฑ ฒ ณ ท ธ น พ ฟ ภ ม ย ร ฤ ล ฦ ว ฬ ฮ
return qw{
0E04 0E05 0E06 0E07 0E0A 0E0B 0E0C 0E0D
0E11 0E12 0E13 0E17 0E18 0E19 0E1E 0E1F
0E20 0E21 0E22 0E23 0E24 0E25 0E26 0E27
0E2C 0E2E
}
}
sub IsThaiVowel { #THAI VOWELS
#NOTE: 0E4D combines with a consonant but may not be considered a vowel
# ย ฤ ฦ ว อ ะ ั า ํา ิ ี ึ ื ุ ู ฺ เ แ โ ใ ไ ๅ ็ ํ
return qw{
0E22 0E24 0E26 0E27 0E2D 0E30 0E31 0E32
0E33 0E34 0E35 0E36 0E37 0E38 0E39 0E3A
0E40 0E41 0E42 0E43 0E44 0E45 0E47 0E4D
}
}
sub IsThaiPreVowel { #VOWELS PRECEDING CONSONANT
# เ แ โ ใ ไ
return qw{
0E40 0E41 0E42 0E43 0E44
}
}
#--------------------------------------------------------------
# Alias the "In..." forms (same as above)
#--------------------------------------------------------------
sub InThaiHCons { &IsThaiHCons };
sub InThaiMCons { &IsThaiMCons };
sub InThaiLCons { &IsThaiLCons };
sub InThaiVowel { &IsThaiVowel };
sub InThaiPreVowel { &IsThaiPreVowel };
#--------------------------------------------------------------
# Provide spelled-out forms of the individual characters
#--------------------------------------------------------------
sub IsKokai { return '0E01' } # ก - THAI CHARACTER KO KAI
sub IsKhokhai { return '0E02' } # ข - THAI CHARACTER KHO KHAI
#--------------------------------------------------------------
# Alias the spelled-out individual characters
#--------------------------------------------------------------
sub InKokai { &IsKokai }
sub InKhokhai { &IsKhokhai }
1;
__END__
=pod
=encoding utf8
=head1 DESCRIPTION
This module supplements the UTF-8 character-class definitions
available to regular expressions (regex) with special groups
relevant to Thai linguistics. The following classes are defined:
โมดูลนี้เป็นส่วนเสริมคำจำกัดความคลาสอักขระ UTF-8
ใช้ได้กับนิพจน์ทั่วไป (regex) ด้วยกลุ่มพิเศษ
ที่เกี่ยวข้องกับภาษาศาสตร์ไทย มีการกำหนดคลาสต่อไปนี้:
=over 4
=item InThaiVowel / IsThaiVowel
Matches Thai vowels only, including compounded and free-standing vowels.
Exceptions here include several of the "consonants" which also serve as
vowels: or-ang, yo-yak, double ro-reua, leut and reut, and wo-wen.
NOTE: Thai vowels cannot stand alone: they are always connected with a
consonant. Many of these, without their consonant companions, will appear
with the unicode dotted-circle character (U+25CC) when rendered, showing
a character is missing. Conversely, Thai consonants can exist without a
vowel, and some Thai words do not have written vowels (the vowel is implied).
=item InThaiPreVowel / IsThaiPreVowel
Matches only the subset of vowels which appear _before_ the consonant
with which they are associated (though in Thai they are sounded _after_
said consonant); this excludes all consonant-vowels and does not include
any of the compounded vowels.
=item InThaiHCons / IsThaiHCons
Matches Thai high-class consonants.
=item InThaiMCons / IsThaiMCons
Matches Thai middle-class consonants.
=item InThaiLCons / IsThaiLCons
Matches Thai low-class consonants.
=back
=cut
And here is the SCRIPT code:
#!/usr/bin/perl
#TEST THAI MODULE
use strict;
use warnings;
use lib '/var/www/lib/';
use RegexpCharClassesThai;
use RegexpCharClassesThai qw( :all );
use utf8;
use Test::More;
binmode STDERR, ":utf8";
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
BEGIN {
print "Content-Type:text/html; charset=utf-8\n";
print "Content-Language: utf8;\n\n";
}
print <<PAGE;
<html lang="utf8">
<body>
<h3>Checking the Thai module</h3>
PAGE
use_ok('RegexpCharClassesThai');
print "<h5>Positives...</h5>";
is( 'ก' =~ /[\p{IsKokai}]/,1,' Match for "ก" =~ /\p{IsKokai}/<br>');
is( 'ก' =~ /\p{InThaiMCons}/,1,' Match for "ก" =~ /\p{InThaiMCons}/<br>');
#PRODUCES ERROR, STOPPING CODE EXECUTION
#is( 'ก' =~ /\p{InThaiNonexistent}/,1,' Match for "ก" =~ /\p{InThaiFinCons}/<br>');
print "<h5>Negatives...</h5>";
isnt( 'ก' =~ /\p{InThaiHCons}/,1,' No match for "ก" =~ /\p{InThaiHCons}/<br>');
isnt( 'ก' =~ /\p{InThaiLCons}/,1,' No match for "ก" =~ /\p{InThaiLCons}/<br>');
isnt( 'ก' =~ /\p{InThaiVowel}/,1,' No match for "ก" =~ /\p{InThaiVowel}/<br>');
isnt( 'ก' =~ /\p{InThaiPreVowel}/,1,' No match for "ก" =~ /\p{InThaiPreVowel}/<br>');
print "<h5>Positives...</h5>";
is( 'ไ' =~ /\p{InThaiVowel}/,1,' Match for "ไ" =~ /\p{InThaiVowel}/<br>');
is( 'ไ' =~ /\p{InThaiPreVowel}/,1,' Match for "ไ" =~ /\p{InThaiPreVowel}/<br>');
print "<h5>Negatives...</h5>";
isnt( 'ไ' =~ /\p{InThaiHCons}/,1,' No match for "ไ" =~ /\p{InThaiHCons}/<br>');
isnt( 'ไ' =~ /\p{InThaiMCons}/,1,' No match for "ไ" =~ /\p{InThaiMCons}/<br>');
isnt( 'ไ' =~ /\p{InThaiLCons}/,1,' No match for "ไ" =~ /\p{InThaiLCons}/<br>');
print <<PAGE;
<h3>Check:</h3>
<p>PATH: $ENV{PATH}</p>
<p>INC: @INC</p>
</body>
</html>
PAGE
As the script output to the browser indicates, there is a problem with the "Positives": one works, the other does not in both the consonant and the vowel cases. If the module were not properly read ("used"), the errors would stop code execution. But the module is being read, and, to my eye, the subroutines of the working and non-working rules both follow the same style. I have no idea what more could be done to fix the ones that are not working.
All this just goes to show how truly "gifted" I am...the system is always "gifting" me with problems that no one else seems to be privileged to experience! (Now, perhaps some eagle-eyed coder will embarrass me by pointing out the most obvious of flaws...ha! And yet I should be most glad of it!)
Note that in this post everything is copy/pasted from the original (already trimmed) sources, with the only alterations being those required to format it for proper display here. In other words, if these scripts run on your server, then my server may have some issues. If, however, the problem is in the code itself, your server should reflect the same issues I'm seeing. (Encoding of the UTF8 characters may be an issue in proper transfer, however, as this site converts them to HTML-entities--why can't perlmonks.org be more up-to-date with encodings? /gripe.)
| [reply] [d/l] |
|
|
Your subs are expected to return a string.
For example, the second test passes with the following definition for IsThaiMCons:
sub IsThaiMCons {
return <<'.';
0E01
0E08
0E0E 0E0F
0E14 0E15
0E1A 0E1B
0E2D
.
}
| [reply] [d/l] [select] |
|
|
I appreciate the link. I read that just a few days back while working on this, but I feel I have more fully grasped it this time. It appears to indicate that each codepoint, or range, should terminate with a newline character--something that my present code is not doing and which your revision does.
However, when I replaced my code with yours, it effected no change--the second test still fails. Is this mysterious? or can it be rationally explained?
As I said, I seem rather "gifted"....
| [reply] |
|
|
$ cat reproduce_wide_char_warn.t
#!perl
use strict;
use warnings;
use utf8;
use Test::More tests => 1;
is "🐪" =~ /\N{DROMEDARY CAMEL}/, 1,
'Test: "🐪" =~ /\N{DROMEDARY CAMEL}/';
Output:
$ prove -v reproduce_wide_char_warn.t
reproduce_wide_char_warn.t ..
1..1
ok 1 - Test: "🐪" =~ /\N{DROMEDARY CAMEL}/
Wide character in print at /home/ken/perl5/perlbrew/perls/perl-5.39.3/lib/5.39.3/Test2/Formatter/TAP.pm line 156.
ok
All tests successful.
Files=1, Tests=1, 0 wallclock secs ( 0.02 usr 0.03 sys + 0.08 cusr 0.08 csys = 0.20 CPU)
Result: PASS
Resolve warning
Code:
$ cat fix_wide_char_warn.t
#!perl
use strict;
use warnings;
use utf8;
use open OUT => qw{:encoding(UTF-8) :std};
use Test::More tests => 1;
is "🐪" =~ /\N{DROMEDARY CAMEL}/, 1,
'Test: "🐪" =~ /\N{DROMEDARY CAMEL}/';
Output:
$ prove -v fix_wide_char_warn.t
fix_wide_char_warn.t ..
1..1
ok 1 - Test: "🐪" =~ /\N{DROMEDARY CAMEL}/
ok
All tests successful.
Files=1, Tests=1, 0 wallclock secs ( 0.02 usr 0.03 sys + 0.08 cusr 0.09 csys = 0.22 CPU)
Result: PASS
I'm not familiar with the Thai script, so I don't think I can help you much there.
I did notice that some characters are very similar (e.g. ข & ฃ); so look closely at that sort of thing.
I'm also aware that in some Unicode scripts the glyph changes depending on context
(e.g. initial/start of word, medial/middle of word, final/end of word, isolate/by itself);
so that's maybe something to investigate.
| [reply] |
|
|
The similarities among Thai characters is one of the reasons a Thai reader who can read without stumbling is rare. The biggest reason is the lack of spaces between words (and this is one of the reasons these character classes would be so helpful, as they could help with word-splitting or syllable-splitting Thai text). I've never heard of a Thai speed reader, and I do not believe it is possible. That said, there is definitely a difference between the otherwise similar characters in usage and in pronunciation; and experienced readers will be able to guess the word without seeing the minute details. Others, like myself, prefer to have a larger font size so as to see those minutiae more clearly.
None of these issues, however, affect the script at present. I have gone over the codepoints with a virtual fine-toothed comb, rechecking all of them. I did make a couple of corrections in the process, one minor one being the removal of the unassigned codepoints from one of the code blocks. There remain some edge cases which the script does not address--perhaps in the future a Thai linguist of superior skill might suggest additional functionality.
I feel more comfortable with the Thai side of this script, at present, than with the Perl side (which simply refuses to work). It is to contribute to the Thai programming community that I do this; and I much appreciate the Perl gurus who are able to help on that side of things.
| [reply] |
|
|
Incidentally, I am no longer using use Test::More;. I discovered that it was the source of all of my errors, including all of the "wide character" log messages, and my code is working well now without it--zero errors being logged. Apparently, Test::More was not designed to be compatible with unicode characters, and is therefore not fit for purpose for my script.
I had planned to use it for the module testing, as is recommended in the guides for module preparation. Now, I'm not sure what to do. How can one go about testing his own script without this module? More importantly, how does one ensure that the installation will not fail in the absence of such a testing environment?
I'm nearly ready to wrap up with the creation of the module but for details of this nature. Packaging for CPAN is a bit cumbersome--at least for the first time around while learning the ropes.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|