in reply to Listing out the characters included in a character class

Presently, after much work, and after cutting out much of the code to make it shorter (per Hippo's sage advice), the script runs, but still logs some errors. I've narrowed the problem down, but am still mystified as to the reason for it. I've even done a full server restart, and yet the errors persist. Here is the script's output to the browser (using "pre" tags for proper UTF8 rendering):
Content-Language: utf8;

Checking the Thai module

ok 1 - use RegexpCharClassesThai;
Positives...
ok 2 - Match for "ก" =~ /\p{IsKokai}/
not ok 3 - Match for "ก" =~ /\p{InThaiMCons}/
Negatives...
ok 4 - No match for "ก" =~ /\p{InThaiHCons}/
ok 5 - No match for "ก" =~ /\p{InThaiLCons}/
ok 6 - No match for "ก" =~ /\p{InThaiVowel}/
ok 7 - No match for "ก" =~ /\p{InThaiPreVowel}/
Positives...
not ok 8 - Match for "ไ" =~ /\p{InThaiVowel}/
ok 9 - Match for "ไ" =~ /\p{InThaiPreVowel}/
Negatives...
ok 10 - No match for "ไ" =~ /\p{InThaiHCons}/
ok 11 - No match for "ไ" =~ /\p{InThaiMCons}/
ok 12 - No match for "ไ" =~ /\p{InThaiLCons}/

Check:

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

INC: /var/www/lib/ /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.34.0 /usr/local/share/perl/5.34.0 /usr/lib/x86_64-linux-gnu/perl5/5.34 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.34 /usr/share/perl/5.34 /usr/local/lib/site_perl /etc/apache2


Here is the output to the log file:
[Wed Nov 01 06:37:06.123037 2023] [core:error] [pid 754:tid 1397026418 +29440] [client 192.168.1.101:58127] Premature end of script headers: +test-thai-mod.pl [Wed Nov 01 06:37:06.123072 2023] [perl:warn] [pid 754:tid 13970264182 +9440] /cgi/test-thai-mod.pl did not send an HTTP header Wide character in print at /usr/share/perl/5.34/Test2/Formatter/TAP.pm + line 125. # Failed test ' Match for "&#3585;" =~ /\p{InThaiMCons}/<br>' # at /var/www/cgi/test-thai-mod.pl line 32. # got: '' # expected: '1' Wide character in print at /usr/share/perl/5.34/Test2/Formatter/TAP.pm + line 125. # Failed test ' Match for "&#3652;" =~ /\p{InThaiVowel}/<br>' # at /var/www/cgi/test-thai-mod.pl line 43. # got: '' # expected: '1' [Wed Nov 01 06:38:29.720987 2023] [core:error] [pid 753:tid 1397026418 +29440] [client 192.168.1.101:58138] Premature end of script headers: +test-thai-mod.pl [Wed Nov 01 06:38:29.721022 2023] [perl:warn] [pid 753:tid 13970264182 +9440] /cgi/test-thai-mod.pl did not send an HTTP header Wide character in print at /usr/share/perl/5.34/Test2/Formatter/TAP.pm + line 125. # Failed test ' Match for "&#3585;" =~ /\p{InThaiMCons}/<br>' # at /var/www/cgi/test-thai-mod.pl line 32. # got: '' # expected: '1' Wide character in print at /usr/share/perl/5.34/Test2/Formatter/TAP.pm + line 125. # Failed test ' Match for "&#3652;" =~ /\p{InThaiVowel}/<br>' # at /var/www/cgi/test-thai-mod.pl line 43. # got: '' # expected: '1'

And here is the MODULE code:
package RegexpCharClassesThai;

use 5.008003;
use strict;
use warnings;
use utf8;
use Exporter;

our @ISA = qw(Exporter);

our %EXPORT_TAGS = (

	classes =>
		[ qw(InThaiHCons InThaiMCons InThaiLCons 
		InThaiVowel InThaiPreVowel
		IsThaiHCons IsThaiMCons IsThaiLCons 
		IsThaiVowel IsThaiPreVowel) ],
  
	characters =>
		[ qw(InKokai InKhokhai 
		IsKokai IsKhokhai ) ],  

);
# add all the other ":class" tags to the ":all" class,
# deleting duplicates
 {
   my %seen;
   push @{$EXPORT_TAGS{all}},
     grep {!$seen{$_}++} @{$EXPORT_TAGS{$_}} foreach keys %EXPORT_TAGS;
 }

our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );

our @EXPORT = ( @{ $EXPORT_TAGS{'classes'} } );

our $VERSION = '1.00';

#--------------------------------------------------------------
#	CREATE FUNCTIONALITY FOR SHOWING CONTENTS OF EACH CLASS
#--------------------------------------------------------------

my %char_class_dispatch = (
		InThaiHCons 	=> \&InThaiHCons,
		InThaiMCons 	=> \&InThaiMCons,
		InThaiLCons 	=> \&InThaiLCons,
		InThaiVowel	=> \&InThaiVowel,
		InThaiPreVowel	=> \&InThaiPreVowel,
    );

    sub list {
        my ($char_class) = @_;

        unless (exists $char_class_dispatch{$char_class}) {
            warn "Char class '$char_class' doesn't exist!\n";
            return [];
        }

        return [
            map chr hex, @{$char_class_dispatch{$char_class}->()}
        ];
    }
    
#--------------------------------------------------------------
#	Start with the "Is..." versions
#--------------------------------------------------------------

sub IsThaiHCons { #THAI HIGH-CLASS CONSONANTS
# ข ฃ ฉ ฐ ถ ผ ฝ ศ ษ ส ห
    return qw{
 0E02 0E03 0E09 0E10 0E16 0E1C 0E1D 0E28 
 0E29 0E2A 0E2B
 }
}

sub IsThaiMCons { #THAI MID-CLASS CONSONANTS
# ก จ ฎ ฏ ด ต บ ป อ
    return qw{
 0E01 0E08 0E0E 0E0F 0E14 0E15 0E1A 0E1B 
 0E2D 
 }
}

sub IsThaiLCons { #THAI LOW-CLASS CONSONANTS
# ค ฅ ฆ ง ช ซ ฌ ญ ฑ ฒ ณ ท ธ น พ ฟ ภ ม ย ร ฤ ล ฦ ว ฬ ฮ 
    return qw{
 0E04 0E05 0E06 0E07 0E0A 0E0B 0E0C 0E0D 
 0E11 0E12 0E13 0E17 0E18 0E19 0E1E 0E1F 
 0E20 0E21 0E22 0E23 0E24 0E25 0E26 0E27 
 0E2C 0E2E 
 }
}

sub IsThaiVowel { #THAI VOWELS
#NOTE: 0E4D combines with a consonant but may not be considered a vowel
# ย ฤ ฦ ว อ ะ ั า ํา ิ ี ึ ื ุ ู ฺ เ แ โ ใ ไ ๅ ็ ํ
    return qw{
 0E22 0E24 0E26 0E27 0E2D 0E30 0E31 0E32 
 0E33 0E34 0E35 0E36 0E37 0E38 0E39 0E3A 
 0E40 0E41 0E42 0E43 0E44 0E45 0E47 0E4D 
 }
}

sub IsThaiPreVowel { #VOWELS PRECEDING CONSONANT
# เ แ โ ใ ไ
    return qw{
 0E40 0E41 0E42 0E43 0E44 
 }
}

#--------------------------------------------------------------
#	Alias the "In..." forms (same as above)
#--------------------------------------------------------------

		sub InThaiHCons 	{ &IsThaiHCons 		}; 	
		sub InThaiMCons 	{ &IsThaiMCons 		}; 	
		sub InThaiLCons 	{ &IsThaiLCons 		}; 	
		sub InThaiVowel 	{ &IsThaiVowel 		};		
		sub InThaiPreVowel 	{ &IsThaiPreVowel 	};	


#--------------------------------------------------------------
#	Provide spelled-out forms of the individual characters
#--------------------------------------------------------------

sub IsKokai 		{ return '0E01' } 	# ก - THAI CHARACTER KO KAI
sub IsKhokhai 		{ return '0E02' } 	# ข - THAI CHARACTER KHO KHAI

#--------------------------------------------------------------
#	Alias the spelled-out individual characters
#--------------------------------------------------------------

sub InKokai 		{ &IsKokai 		}
sub InKhokhai 		{ &IsKhokhai 		}

1;

__END__

=pod

=encoding utf8

=head1 DESCRIPTION

This module supplements the UTF-8 character-class definitions 
available to regular expressions (regex) with special groups 
relevant to Thai linguistics.  The following classes are defined:

	โมดูลนี้เป็นส่วนเสริมคำจำกัดความคลาสอักขระ UTF-8
	ใช้ได้กับนิพจน์ทั่วไป (regex) ด้วยกลุ่มพิเศษ
	ที่เกี่ยวข้องกับภาษาศาสตร์ไทย มีการกำหนดคลาสต่อไปนี้:


=over 4

=item InThaiVowel / IsThaiVowel

Matches Thai vowels only, including compounded and free-standing vowels.
Exceptions here include several of the "consonants" which also serve as
vowels: or-ang, yo-yak, double ro-reua, leut and reut, and wo-wen.

NOTE: Thai vowels cannot stand alone: they are always connected with a
consonant.  Many of these, without their consonant companions, will appear
with the unicode dotted-circle character (U+25CC) when rendered, showing
a character is missing.  Conversely, Thai consonants can exist without a
vowel, and some Thai words do not have written vowels (the vowel is implied).

=item InThaiPreVowel / IsThaiPreVowel

Matches only the subset of vowels which appear _before_ the consonant 
with which they are associated (though in Thai they are sounded _after_ 
said consonant); this excludes all consonant-vowels and does not include 
any of the compounded vowels.

=item InThaiHCons / IsThaiHCons

Matches Thai high-class consonants.

=item InThaiMCons / IsThaiMCons

Matches Thai middle-class consonants.

=item InThaiLCons / IsThaiLCons

Matches Thai low-class consonants.

=back

=cut

And here is the SCRIPT code:
#!/usr/bin/perl

#TEST THAI MODULE

use strict;
use warnings;
use lib '/var/www/lib/';
use RegexpCharClassesThai;  
use RegexpCharClassesThai qw( :all );  
use utf8;
use Test::More;
binmode STDERR, ":utf8";
binmode STDIN,  ":utf8";
binmode STDOUT, ":utf8";

BEGIN {
	print "Content-Type:text/html; charset=utf-8\n";
	print "Content-Language: utf8;\n\n";
}

print <<PAGE;
<html lang="utf8">
<body>

<h3>Checking the Thai module</h3>
PAGE

use_ok('RegexpCharClassesThai');

print "<h5>Positives...</h5>";
is( 'ก' =~ /[\p{IsKokai}]/,1,' Match for  "ก" =~ /\p{IsKokai}/<br>');
is( 'ก' =~ /\p{InThaiMCons}/,1,' Match for  "ก" =~ /\p{InThaiMCons}/<br>');
#PRODUCES ERROR, STOPPING CODE EXECUTION
#is( 'ก' =~ /\p{InThaiNonexistent}/,1,' Match for  "ก" =~ /\p{InThaiFinCons}/<br>');

print "<h5>Negatives...</h5>";
isnt( 'ก' =~ /\p{InThaiHCons}/,1,' No match for  "ก" =~ /\p{InThaiHCons}/<br>');
isnt( 'ก' =~ /\p{InThaiLCons}/,1,' No match for  "ก" =~ /\p{InThaiLCons}/<br>');
isnt( 'ก' =~ /\p{InThaiVowel}/,1,' No match for  "ก" =~ /\p{InThaiVowel}/<br>');
isnt( 'ก' =~ /\p{InThaiPreVowel}/,1,' No match for  "ก" =~ /\p{InThaiPreVowel}/<br>');
 
print "<h5>Positives...</h5>";
is( 'ไ' =~ /\p{InThaiVowel}/,1,' Match for  "ไ" =~ /\p{InThaiVowel}/<br>');
is( 'ไ' =~ /\p{InThaiPreVowel}/,1,' Match for  "ไ" =~ /\p{InThaiPreVowel}/<br>');

print "<h5>Negatives...</h5>";
isnt( 'ไ' =~ /\p{InThaiHCons}/,1,' No match for  "ไ" =~ /\p{InThaiHCons}/<br>');
isnt( 'ไ' =~ /\p{InThaiMCons}/,1,' No match for  "ไ" =~ /\p{InThaiMCons}/<br>');
isnt( 'ไ' =~ /\p{InThaiLCons}/,1,' No match for  "ไ" =~ /\p{InThaiLCons}/<br>');

print <<PAGE;

<h3>Check:</h3>
<p>PATH: $ENV{PATH}</p>
<p>INC: @INC</p>
</body>
</html>

PAGE

As the script output to the browser indicates, there is a problem with the "Positives": one works, the other does not in both the consonant and the vowel cases. If the module were not properly read ("used"), the errors would stop code execution. But the module is being read, and, to my eye, the subroutines of the working and non-working rules both follow the same style. I have no idea what more could be done to fix the ones that are not working.

All this just goes to show how truly "gifted" I am...the system is always "gifting" me with problems that no one else seems to be privileged to experience! (Now, perhaps some eagle-eyed coder will embarrass me by pointing out the most obvious of flaws...ha! And yet I should be most glad of it!)

Note that in this post everything is copy/pasted from the original (already trimmed) sources, with the only alterations being those required to format it for proper display here. In other words, if these scripts run on your server, then my server may have some issues. If, however, the problem is in the code itself, your server should reflect the same issues I'm seeing. (Encoding of the UTF8 characters may be an issue in proper transfer, however, as this site converts them to HTML-entities--why can't perlmonks.org be more up-to-date with encodings? /gripe.)

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re^2: Listing out the characters included in a character class
by ikegami (Patriarch) on Nov 01, 2023 at 08:00 UTC

    Your subs are expected to return a string.

    For example, the second test passes with the following definition for IsThaiMCons:

    sub IsThaiMCons { return <<'.'; 0E01 0E08 0E0E 0E0F 0E14 0E15 0E1A 0E1B 0E2D . }
      I appreciate the link. I read that just a few days back while working on this, but I feel I have more fully grasped it this time. It appears to indicate that each codepoint, or range, should terminate with a newline character--something that my present code is not doing and which your revision does.

      However, when I replaced my code with yours, it effected no change--the second test still fails. Is this mysterious? or can it be rationally explained?

      As I said, I seem rather "gifted"....

      Blessings,

      ~Polyglot~

Re^2: Listing out the characters included in a character class [wide character warning]
by kcott (Archbishop) on Nov 01, 2023 at 10:06 UTC

    There are number of ways to deal with "Wide character" warnings; I'd generally use the open pragma.

    Simulate warning

    Code:

    $ cat reproduce_wide_char_warn.t
    #!perl
    
    use strict;
    use warnings;
    use utf8;
    
    use Test::More tests => 1;
    
    is "🐪" =~ /\N{DROMEDARY CAMEL}/, 1,
        'Test: "🐪" =~ /\N{DROMEDARY CAMEL}/';
    

    Output:

    $ prove -v reproduce_wide_char_warn.t
    reproduce_wide_char_warn.t ..
    1..1
    ok 1 - Test: "🐪" =~ /\N{DROMEDARY CAMEL}/
    Wide character in print at /home/ken/perl5/perlbrew/perls/perl-5.39.3/lib/5.39.3/Test2/Formatter/TAP.pm line 156.
    ok
    All tests successful.
    Files=1, Tests=1,  0 wallclock secs ( 0.02 usr  0.03 sys +  0.08 cusr  0.08 csys =  0.20 CPU)
    Result: PASS
    

    Resolve warning

    Code:

    $ cat fix_wide_char_warn.t
    #!perl
    
    use strict;
    use warnings;
    use utf8;
    use open OUT => qw{:encoding(UTF-8) :std};
    
    use Test::More tests => 1;
    
    is "🐪" =~ /\N{DROMEDARY CAMEL}/, 1,
        'Test: "🐪" =~ /\N{DROMEDARY CAMEL}/';
    

    Output:

    $ prove -v fix_wide_char_warn.t
    fix_wide_char_warn.t ..
    1..1
    ok 1 - Test: "🐪" =~ /\N{DROMEDARY CAMEL}/
    ok
    All tests successful.
    Files=1, Tests=1,  0 wallclock secs ( 0.02 usr  0.03 sys +  0.08 cusr  0.09 csys =  0.22 CPU)
    Result: PASS
    

    I'm not familiar with the Thai script, so I don't think I can help you much there. I did notice that some characters are very similar (e.g. & ); so look closely at that sort of thing. I'm also aware that in some Unicode scripts the glyph changes depending on context (e.g. initial/start of word, medial/middle of word, final/end of word, isolate/by itself); so that's maybe something to investigate.

    — Ken

      The similarities among Thai characters is one of the reasons a Thai reader who can read without stumbling is rare. The biggest reason is the lack of spaces between words (and this is one of the reasons these character classes would be so helpful, as they could help with word-splitting or syllable-splitting Thai text). I've never heard of a Thai speed reader, and I do not believe it is possible. That said, there is definitely a difference between the otherwise similar characters in usage and in pronunciation; and experienced readers will be able to guess the word without seeing the minute details. Others, like myself, prefer to have a larger font size so as to see those minutiae more clearly.

      None of these issues, however, affect the script at present. I have gone over the codepoints with a virtual fine-toothed comb, rechecking all of them. I did make a couple of corrections in the process, one minor one being the removal of the unassigned codepoints from one of the code blocks. There remain some edge cases which the script does not address--perhaps in the future a Thai linguist of superior skill might suggest additional functionality.

      I feel more comfortable with the Thai side of this script, at present, than with the Perl side (which simply refuses to work). It is to contribute to the Thai programming community that I do this; and I much appreciate the Perl gurus who are able to help on that side of things.

      Blessings,

      ~Polyglot~

      Incidentally, I am no longer using use Test::More;. I discovered that it was the source of all of my errors, including all of the "wide character" log messages, and my code is working well now without it--zero errors being logged. Apparently, Test::More was not designed to be compatible with unicode characters, and is therefore not fit for purpose for my script.

      I had planned to use it for the module testing, as is recommended in the guides for module preparation. Now, I'm not sure what to do. How can one go about testing his own script without this module? More importantly, how does one ensure that the installation will not fail in the absence of such a testing environment?

      I'm nearly ready to wrap up with the creation of the module but for details of this nature. Packaging for CPAN is a bit cumbersome--at least for the first time around while learning the ropes.

      Blessings,

      ~Polyglot~

        I am no longer using use Test::More;. I discovered that it was the source of all of my errors

        Perl testing is based on the Test Anything Protocol, which is also used to test languages other than Perl ... so there is no requirement for your module to use Test::More.

        That said, it would be nice if you could provide us with a SSCCE that clearly illustrates the problems you were experiencing with Test::More.

        See also: hippo's excellent Basic Testing Tutorial

        👁️🍾👍🦟

        "Wide character in ..." is a warning. See "perldiag: Wide character in %s". Please stop calling it an error.

        You showed this warning when using Test::More:

        Wide character in print at /.../Test2/Formatter/TAP.pm line 125.

        I simulated that warning when using Test::More:

        Wide character in print at /.../Test2/Formatter/TAP.pm line 156.

        The only difference being the line number which I'd guess, in the absence of other information, is due to you using a different version. Test::More and Test2::Formatter::TAP (along with many other modules) are part of the Test-Simple distribution. I'm using:

        $ perl -E 'use Test::More; say $Test::More::VERSION;' 1.302195 $ perl -E 'use Test2::Formatter::TAP; say $Test2::Formatter::TAP::VERS +ION;' 1.302195

        What version are you using?

        My line 156 looks like this:

        print $io $ok;

        What does your line 125 look like?

        I provided you with a solution to your problem by using:

        use open OUT => qw{:encoding(UTF-8) :std};

        Did you try that? If so, what was the outcome? If not, why not?

        The issue here is in no way specific to Test::More. Consider this code which generates the warning:

        $ perl -e '
            print "\N{DROMEDARY CAMEL}\n";
        '
        Wide character in print at -e line 2.
        🐪
        

        And this code which does not:

        $ perl -e '
            use open OUT => qw{:encoding(UTF-8) :std};
            print "\N{DROMEDARY CAMEL}\n";
        '
        🐪
        

        — Ken

        I discovered that it was the source of all of my errors

        The errors ("Premature end of script headers") are logged because the warnings are being printed before your HTTP headers, because while you took my advice to wrap the headers in a BEGIN block, you skipped the part of my advice where I explained that the BEGIN block might need to go before certain modules were even used. But getting that right is difficult, which is why I also suggested use CGI::Carp qw(fatalsToBrowser); (because that might help with your debug process).

        The warnings ("wide character") are because your script didn't set up the appropriate binmode/open-mode for all the various outputs that are used -- and other monks have given you better advice than I could on that, including Test::More::UTF8 , which I had never heard of, but will definitely keep in my arsenal going forward.

        I had planned to use it for the module testing

        I will admit that I haven't tested a CGI script, per se. But using Test::More inside the script that's generating the response to the browser seems weird to me. Normally tests are run from the command line (not on your live webserver), and the test script will call the various functions from the modules you wrote that your CGI script is calling. And, if you end up with a lot of logic/etc inside your CGI script that needs testing, you could even have a test that runs your CGI script (CGI can be run from the command line, without the involvement of the webserver -- even the old CGI.pm documentation explained how to do that). Or you could even have an HTTP client inside your test script, which would connect to the webserver to test the endpoints of your CGI (testing on your live server is probably not the best either, but you could have your test suite launch a private webserver instance on your test machine, without it being on your final webhost yet).

        But Test::More and the TAP protocol put some of the output (like the test name and ok/not-ok) to STDOUT, but also puts information (like the failure diagnostics) to STDERR -- and trying to properly handle the TAP output inside the CGI environment to generate a valid webpage seems tricky, at best, and I am convinced that (not a problem in Test::More itself) is the cause of your headaches.

        Test::More is used widely. It's highly improbable it can cause any errors. Regarding the wide characters, maybe all you needed was Test::More::UTF8?

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]