in reply to Re: Listing out the characters included in a character class
in thread Listing out the characters included in a character class

This was the best response yet...and the voters seem to agree. Thank you.

Your InThaiHCons() and InThaiLCons() seem overcomplicated.

There are two nuances to this which you may not have grasped: 1) The double-column codepoints in the 'InThaiLCons' indicate ranges, i.e. the '0E04 0E07' line will actually return '0E04 0E05 0E06 0E07'; and 2) I have formatted the 'InThaiHCons' as I have in order to be able to indicate in the markup what the codepoints represent. It's hard to look at a codepoint and just remember which character it is for, and as the code maintainer, this association helps me tremendously, especially for certain characters. However, I am considering removing those comments for the sake of code brevity and tidiness before releasing the module to CPAN, which I fully intend to do soon, having delayed years already in doing so due to my own lack of confidence (this will be a first for me).

That said, in my quest for methods to do what I want done, I discovered that the subroutines can be called in the code in a different context than that of a regular expression, and they will, themselves, return the codepoints I desire. However, they do not preserve the double-columnness demonstrated by the 'InThaiLCons' of my example, simply putting all the codepoints in a straight list--so I have decided not to use those ranges, despite their obvious efficiency, and just list every single codepoint. This solves a couple problems at once, with only the problem of increasing the visible size of the lists (i.e. more code). So, my new 'InThaiLCons' would look like this:

sub InThaiLCons { return join "\n", '0E04', '0E05', '0E06', '0E07', '0E0A', '0E0B', '0E0C', '0E0D', '0E11', '0E12', '0E13', '0E17', '0E18', '0E19', '0E1E', '0E1F', '0E20', '0E21', '0E22', '0E23', '0E24', '0E25', '0E26', '0E27', '0E2C', '0E2E', }

However, after your suggestions, that can be more efficiently represented as:

sub InThaiLCons { return [qw{ 0E04 0E05 0E06 0E07 0E0A 0E0B 0E0C 0E0D 0E11 0E12 0E13 0E17 0E18 0E19 0E1E 0E1F 0E20 0E21 0E22 0E23 0E24 0E25 0E26 0E27 0E2C 0E2E }] }
I have a new problem, in that I want to use two names for each of these subroutines: i.e. 'InThai...' and 'IsThai...'. Essentially, they appear to be synonymous for many current usages, and I wish for either of these forms to be acceptable with this new functionality as well. So, must I repeat the entire subroutine in the code, changing only its name? or is there a way to alias it to another name?

Regarding the use of <pre> tags, are they equivalent to the <code> tags? I had put the UTF8 characters in a <code> block, and they got converted to ugly HTML-entities. That's why I moved them to outside of that block.

Incidentally, there will indeed also be an 'InThaiMCons' definition in this module (and more)!

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re^3: Listing out the characters included in a character class
by hippo (Archbishop) on Oct 28, 2023 at 12:54 UTC
    So, must I repeat the entire subroutine in the code, changing only its name? or is there a way to alias it to another name?

    There are two methods to acheive this without duplication of the code inside the subroutine. viz:

    #!/usr/bin/env perl use strict; use warnings; sub hello { print "Hello $_[0]!\n"; } *hi = \&hello; sub bonjour { &hello } hello ('there'); hi ('world'); bonjour ('Alain');

    It's usually cleaner and clearer just to have the one name for any given piece of functionality, however.


    🦛

      It's usually cleaner and clearer just to have the one name for any given piece of functionality, however.

      I agree with this; however, others before me have already given us all such synonyms as "InThai" and "IsThai". That being the case, others coming along may not know which form to use. Sigh. To my mind, "InThai" looks to represent a range, and "IsThai" represents a quality--but these do happen to both apply to the same codepoints in this case. The same is true, however, for all of my Thai character groupings--essentially anytime more than one character is involved. But because of this overlap, and because it boils down to mere semantics and what people will remember/opine/prefer, I think it best to create the secondary names across the board, for flexibility/compatibility, even for single-codepoint returns.

      The Perl documents are poor in this respect, and do not clarify the distinctions among \p{Thai}, \p{InThai}, \p{IsThai}. An explanation is offered at this URL: https://www.regular-expressions.info/unicode.html, saying:

      Not all Unicode regex engines use the same syntax to match Unicode blocks. Java, Ruby 2.0, and XRegExp use the \p{InBlock} syntax as listed above. .NET and XML use \p{IsBlock} instead. Perl and the JGsoft flavor support both notations. I recommend you use the “In” notation if your regex engine supports it. “In” can only be used for Unicode blocks, while “Is” can also be used for Unicode properties and scripts, depending on the regular expression flavor you’re using. By using “In”, it’s obvious you’re matching a block and not a similarly named property or script.

      Blessings,

      ~Polyglot~

Re^3: Listing out the characters included in a character class
by kcott (Archbishop) on Oct 28, 2023 at 15:40 UTC

    You can still keep ranges. There are better ways to represent them; see code below.

    You can represent Unicode names against individual codepoints; it will become somewhat difficult and possibly messy for ranges of codepoints. I recommend that you have Unicode PDF Character Code Chart "Thai -- Range: 0E00-0E7F" at hand when developing; this sequentially lists the codepoints, their glyphs, their names, and some entries have additional notes. You might consider adding that link to your module's POD. If you're writing code for other (Unicode) scripts, you can find links to all of the current charts at "Unicode 15.1 Character Code Charts".

    Having multiple names for the same subroutine is often confusing and generally, in my opinion, a design flaw; however, it's easily achieved with additional keys in the despatch table. I would urge you to reconsider if that's something you really need.

    Update: I've just posted and saw your reply to hippo. Given your explanation, use of multiple names seems valid in this instance.

    New script and Module:

    ken@titan ~/tmp/pm_11155205_uni_char_class $ ls -l *2* -rw-r--r-- 1 ken None 1275 Oct 29 01:50 PolyUniCharClass2.pm -rwxr-xr-x 1 ken None 370 Oct 29 01:42 uni_char_class_2.pl

    uni_char_class_2.pl:

    #!/usr/bin/env perl use strict; use warnings; use open OUT => qw{:encoding(UTF-8) :std}; use lib '.'; # DEMO ONLY -- DON'T use in PRODUCTION! use PolyUniCharClass2; for my $prefix (qw{In Is If}) { for my $class (qw{H L M}) { my $cons = "${prefix}Thai${class}Cons"; print "$cons:\n"; print @{PolyUniCharClass2::list($cons)}, "\n"; } }

    PolyUniCharClass2.pm:

    package PolyUniCharClass2; use strict; use warnings; { my %char_class_despatch = ( InThaiHCons => \&InThaiHCons, InThaiLCons => \&InThaiLCons, IsThaiHCons => \&InThaiHCons, IsThaiLCons => \&InThaiLCons, ); sub list { my ($char_class) = @_; unless (exists $char_class_despatch{$char_class}) { warn "Char class '$char_class' doesn't exist!\n"; return []; } return [map chr, @{$char_class_despatch{$char_class}->()}]; } } { my $ThaiHCons = [qw{0E02-0E03 0E09 0E10 0E16}]; my $ThaiLCons = [qw{0E04-0E07 0E0A-0E0D 0E11}]; my %ThaiCons_expanded; sub InThaiHCons { return $ThaiCons_expanded{InThaiHCons} ||= _expand($ThaiHCons) +; } sub InThaiLCons { return $ThaiCons_expanded{InThaiLCons} ||= _expand($ThaiLCons) +; } } { my $re = qr{^([0-9A-Fa-f]+)-([0-9A-Fa-f]+)$}; sub _expand { my ($code_range_list) = @_; my @full_list; for my $range (@$code_range_list) { if ($range =~ $re) { push @full_list, hex($1) .. hex($2); } else { push @full_list, hex $range; } } return [@full_list]; } } 1;

    Output:

    $ ./uni_char_class_2.pl
    InThaiHCons:
    ขฃฉฐถ
    InThaiLCons:
    คฅฆงชซฌญฑ
    InThaiMCons:
    Char class 'InThaiMCons' doesn't exist!
    
    IsThaiHCons:
    ขฃฉฐถ
    IsThaiLCons:
    คฅฆงชซฌญฑ
    IsThaiMCons:
    Char class 'IsThaiMCons' doesn't exist!
    
    IfThaiHCons:
    Char class 'IfThaiHCons' doesn't exist!
    
    IfThaiLCons:
    Char class 'IfThaiLCons' doesn't exist!
    
    IfThaiMCons:
    Char class 'IfThaiMCons' doesn't exist!
    

    There are a number of improvements you could make to the module code depending on the Perl version you're targeting. You didn't indicate your Perl version. The code I've presented should, I believe, work fine with Perl 5.6 (but I have no way to check that).

    — Ken

Re^3: Listing out the characters included in a character class [v5.38]
by kcott (Archbishop) on Oct 28, 2023 at 19:42 UTC

    In my last response, I believe I covered all of the coding issues. I finished with:

    "There are a number of improvements you could make to the module code depending on the Perl version you're targeting. ... The code I've presented should, I believe, work fine with Perl 5.6 ..."

    Perl does a great job of keeping up with Unicode versions. The latest Unicode version is 15.1; Perl v5.38 (the latest stable version) supports Unicode 15.0 (see "perl5380delta: Unicode 15.0 is supported"). Writing your code for Perl 5.6 may be insufficient to handle the Unicode support you need; look through the deltas to find the minimum Perl version for your needs.

    Partly because it was a fun task for me, but also to show you some of the improvements you could get from a later version, here's the code rewritten for Perl v5.38 and Unicode 15.0.

    New script and module:

    ken@titan ~/tmp/pm_11155205_uni_char_class $ ls -l *3* -rw-r--r-- 1 ken None 993 Oct 29 05:03 PolyUniCharClass3.pm -rwxr-xr-x 1 ken None 344 Oct 29 05:03 uni_char_class_3.pl

    uni_char_class_3.pl:

    #!/usr/bin/env perl use v5.38; use open OUT => qw{:encoding(UTF-8) :std}; use lib '.'; # DEMO ONLY -- DON'T use in PRODUCTION! use PolyUniCharClass3; for my $prefix (qw{In Is If}) { for my $class (qw{H L M}) { my $cons = "${prefix}Thai${class}Cons"; say "$cons:"; say PolyUniCharClass3::list($cons)->@*; } }

    PolyUniCharClass3.pm:

    package PolyUniCharClass3; use v5.38; sub list ($char_class) { state $valid_char_class = {map +($_ => 1), qw{ InThaiHCons IsThaiHCons InThaiLCons IsThaiLCons }}; unless (exists $valid_char_class->{$char_class}) { warn "Char class '$char_class' doesn't exist!\n"; return []; } return [map chr, ThaiCons(substr $char_class, 2)->@*]; } sub ThaiCons ($cons) { state $code_ranges = { ThaiHCons => [qw{0E02-0E03 0E09 0E10 0E16}], ThaiLCons => [qw{0E04-0E07 0E0A-0E0D 0E11}], }; state $ThaiCons_expanded; return $ThaiCons_expanded->{$cons} //= _expand($code_ranges->{$con +s}); } sub _expand ($code_range_list) { state $re = qr{^([0-9A-Fa-f]+)-([0-9A-Fa-f]+)$}; my @full_list; for my $range ($code_range_list->@*) { if ($range =~ $re) { push @full_list, hex($1) .. hex($2); } else { push @full_list, hex $range; } } return [@full_list]; }

    Output (unchanged):

    ken@titan ~/tmp/pm_11155205_uni_char_class
    $ ./uni_char_class_3.pl
    InThaiHCons:
    ขฃฉฐถ
    InThaiLCons:
    คฅฆงชซฌญฑ
    InThaiMCons:
    Char class 'InThaiMCons' doesn't exist!
    
    IsThaiHCons:
    ขฃฉฐถ
    IsThaiLCons:
    คฅฆงชซฌญฑ
    IsThaiMCons:
    Char class 'IsThaiMCons' doesn't exist!
    
    IfThaiHCons:
    Char class 'IfThaiHCons' doesn't exist!
    
    IfThaiLCons:
    Char class 'IfThaiLCons' doesn't exist!
    
    IfThaiMCons:
    Char class 'IfThaiMCons' doesn't exist!
    

    There were a couple of points at the end of your post which I didn't address. Here goes:

    "Regarding the use of <pre> tags, are they equivalent to the <code> tags?"

    They sort of do the same job but have these differences:

    • Unicode characters will be rendered instead of the entity references you get with <code> tags. After previewing, you may see entity references in the textarea where you're typing, but the preview itself should show the characters (assuming you have appropriate fonts to display them).
    • You won't get a [download] link at the end of the block.
    • You won't get code wrapping: <code> tags will break long lines, starting wrapped lines with a prominent + (by default, it's red). Because of this, aim to keep lines short.
    • You'll need to handle special characters yourself; e.g. writing &#91; instead of [. See "site how to: Submitting Code and Escaping Characters" for more about that.
    "Incidentally, there will indeed also be an 'InThaiMCons' definition in this module (and more)!"

    I picked the names like If* and *M* for my testing. Your test suite (t/*.t scripts) should check that both success and failure are handled appropriately.

    — Ken

      Well, I've nearly finished polishing up the module itself--still some work to do on the testing script, but it is at least functional. The module, however, is not working properly on my machine, and produces failure messages in the logs. I have put the full code, as I intend soon to publish it anyhow, on my scratchpad: Polyglot's scratchpad.

      The errors I'm getting look like this:

      [Mon Oct 30 05:11:03.311339 2023] [core:error] [pid 188075:tid 1396602 +23264320] [client 192.168.1.101:53954] Premature end of script header +s: test-thai-mod.pl [Mon Oct 30 05:11:03.311358 2023] [perl:warn] [pid 188075:tid 13966022 +3264320] /cgi/test-thai-mod.pl did not send an HTTP header [Mon Oct 30 05:11:03.311388 2023] [:error] [pid 188075:tid 13966022326 +4320] Undefined subroutine &ModPerl::ROOT::ModPerl::PerlRun::var_www_ +cgi_test_2dthai_2dmod_2epl::IsThaiLCons called at /var/www/cgi/test-t +hai-mod.pl line 24.\n
      The "did not send an HTTP header" has nothing to do with the header, but with premature exiting of code execution due to other problems. The "Undefined subroutine" seems to be the issue, and I have no clue why. Once, with a similar error message, I restarted the apache2 server and all was well. But that no longer works on this new message. I am left not knowing whether my apache2 server is at issue, or whether it is this code--but probably the latter. There's certainly no point trying to publish code that is not first functional, so any help on this would be much appreciated.

      Blessings,

      ~Polyglot~

        "... any help on this would be much appreciated."

        That's nigh impossible without seeing /var/www/cgi/test-thai-mod.pl.

        ModPerl::ROOT::ModPerl::PerlRun::var_www_cgi_test_2dthai_2dmod_2epl looks like a very unusual module name. Perhaps a typo on "/var/www/cgi/test-thai-mod.pl line 24".

        When you post test-thai-mod.pl, the reason for showing Regexp::CharClasses::Thai on your scratchpad may become apparent: without further information, it seems irrelevant.

        — Ken

Re^3: Listing out the characters included in a character class
by Polyglot (Chaplain) on Nov 29, 2023 at 17:03 UTC
    By the way, I discovered that the shortened form of the subroutine did not work, and have reverted to my previous format for returning the codepoints. This is the sort of error I got from it...
    Can't find Unicode property definition "ARRAY(0x55ff865ace80)" in expa +nsion of main::IsThai

    Blessings,

    ~Polyglot~