Re^3: Listing out the characters included in a character class [wide character warning]
by Polyglot (Chaplain) on Nov 01, 2023 at 11:40 UTC
|
The similarities among Thai characters is one of the reasons a Thai reader who can read without stumbling is rare. The biggest reason is the lack of spaces between words (and this is one of the reasons these character classes would be so helpful, as they could help with word-splitting or syllable-splitting Thai text). I've never heard of a Thai speed reader, and I do not believe it is possible. That said, there is definitely a difference between the otherwise similar characters in usage and in pronunciation; and experienced readers will be able to guess the word without seeing the minute details. Others, like myself, prefer to have a larger font size so as to see those minutiae more clearly.
None of these issues, however, affect the script at present. I have gone over the codepoints with a virtual fine-toothed comb, rechecking all of them. I did make a couple of corrections in the process, one minor one being the removal of the unassigned codepoints from one of the code blocks. There remain some edge cases which the script does not address--perhaps in the future a Thai linguist of superior skill might suggest additional functionality.
I feel more comfortable with the Thai side of this script, at present, than with the Perl side (which simply refuses to work). It is to contribute to the Thai programming community that I do this; and I much appreciate the Perl gurus who are able to help on that side of things.
| [reply] |
Re^3: Listing out the characters included in a character class [wide character warning]
by Polyglot (Chaplain) on Nov 03, 2023 at 03:23 UTC
|
Incidentally, I am no longer using use Test::More;. I discovered that it was the source of all of my errors, including all of the "wide character" log messages, and my code is working well now without it--zero errors being logged. Apparently, Test::More was not designed to be compatible with unicode characters, and is therefore not fit for purpose for my script.
I had planned to use it for the module testing, as is recommended in the guides for module preparation. Now, I'm not sure what to do. How can one go about testing his own script without this module? More importantly, how does one ensure that the installation will not fail in the absence of such a testing environment?
I'm nearly ready to wrap up with the creation of the module but for details of this nature. Packaging for CPAN is a bit cumbersome--at least for the first time around while learning the ropes.
| [reply] [d/l] [select] |
|
|
I discovered that it was the source of all of my errors
The errors ("Premature end of script headers") are logged because the warnings are being printed before your HTTP headers, because while you took my advice to wrap the headers in a BEGIN block, you skipped the part of my advice where I explained that the BEGIN block might need to go before certain modules were even used. But getting that right is difficult, which is why I also suggested use CGI::Carp qw(fatalsToBrowser); (because that might help with your debug process).
The warnings ("wide character") are because your script didn't set up the appropriate binmode/open-mode for all the various outputs that are used -- and other monks have given you better advice than I could on that, including Test::More::UTF8 , which I had never heard of, but will definitely keep in my arsenal going forward.
I had planned to use it for the module testing
I will admit that I haven't tested a CGI script, per se. But using Test::More inside the script that's generating the response to the browser seems weird to me. Normally tests are run from the command line (not on your live webserver), and the test script will call the various functions from the modules you wrote that your CGI script is calling. And, if you end up with a lot of logic/etc inside your CGI script that needs testing, you could even have a test that runs your CGI script (CGI can be run from the command line, without the involvement of the webserver -- even the old CGI.pm documentation explained how to do that). Or you could even have an HTTP client inside your test script, which would connect to the webserver to test the endpoints of your CGI (testing on your live server is probably not the best either, but you could have your test suite launch a private webserver instance on your test machine, without it being on your final webhost yet).
But Test::More and the TAP protocol put some of the output (like the test name and ok/not-ok) to STDOUT, but also puts information (like the failure diagnostics) to STDERR -- and trying to properly handle the TAP output inside the CGI environment to generate a valid webpage seems tricky, at best, and I am convinced that (not a problem in Test::More itself) is the cause of your headaches.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
I am no longer using use Test::More;. I discovered that it was the source of all of my errors
Perl testing is based on the Test Anything Protocol, which is also used to test languages other than Perl
...
so there is no requirement for your module to use Test::More.
That said, it would be nice if you could provide us with a SSCCE that clearly illustrates
the problems you were experiencing with Test::More.
See also: hippo's excellent Basic Testing Tutorial
| [reply] [d/l] |
|
|
Unfortunately, I do not think it is possible to provide a correct SSCCE here for this case. The forum converts all of the characters which are related to the problem to HTML-entities, and I am unaware of a method by which the actual files could be attached.
Suffice it for now that the issue is caused by UTF8 embedded in the code and tested by the Test::More tests. Without a way to paste in actual code containing UTF8 characters, unmangled, I see no point in going to the trouble of forming up an SSCCE for this case. I doubt it would be likely to exhibit the same behaviors, post-transfer/conversion, and would thus prove little.
| [reply] |
|
|
|
|
|
|
Wide character in print at /.../Test2/Formatter/TAP.pm line 125.
I simulated that warning when using Test::More:
Wide character in print at /.../Test2/Formatter/TAP.pm line 156.
The only difference being the line number which I'd guess, in the absence of other information,
is due to you using a different version.
Test::More and Test2::Formatter::TAP (along with many other modules) are part of the
Test-Simple distribution.
I'm using:
$ perl -E 'use Test::More; say $Test::More::VERSION;'
1.302195
$ perl -E 'use Test2::Formatter::TAP; say $Test2::Formatter::TAP::VERS
+ION;'
1.302195
What version are you using?
My line 156 looks like this:
print $io $ok;
What does your line 125 look like?
I provided you with a solution to your problem by using:
use open OUT => qw{:encoding(UTF-8) :std};
Did you try that?
If so, what was the outcome?
If not, why not?
The issue here is in no way specific to Test::More.
Consider this code which generates the warning:
$ perl -e '
print "\N{DROMEDARY CAMEL}\n";
'
Wide character in print at -e line 2.
🐪
And this code which does not:
$ perl -e '
use open OUT => qw{:encoding(UTF-8) :std};
print "\N{DROMEDARY CAMEL}\n";
'
🐪
| [reply] [d/l] [select] |
|
|
Note that ordering is important:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use Test::More tests => 1;
use open OUT => ':encoding(UTF-8)', ':std';
is "kůň", 1, 'same';
gives the warning, while
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use open OUT => ':encoding(UTF-8)', ':std';
use Test::More tests => 1;
is "kůň", 1, 'same';
does not.
That's why I recommended Test::More::UTF8. You can place it wherever you like and there are no warnings.
Update: <code> to <pre> to fix the non-English characters.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
|
|
|
|
binmode STDERR, ":utf8";
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
Honestly, there are so many ways in Perl of dealing with UTF8 that the mind spins--and they are not all created equal. I had not seen the particular method you recommended, but again, it turned out no different than what I had had in place already. If it does the same as the three lines, I may prefer it going forward. The three-line version does appear to have the advantage of being more specific, giving one the option of selecting among the three options. And I've, at various times, used other methods as well, including the Encode module, etc.
What does your line 125 look like?
print $io $msg;
...and my line 156 looks identical to yours.
Regarding the last portion of your post, I know you are well-meaning so I will overlook how it comes across. My username is not without significance. From the very beginning of my Perl programming career, I have dealt with non-ASCII encodings (I was programming for Asian languages from the get-go). The "wide character" message is one I have seen thousands of times--and I well know its typical causes.
Thank you for your help! (This is genuine, not being sarcastic--I just felt it necessary to clarify owing to the prior paragraph which might color the perception of my tone.)
| [reply] [d/l] [select] |
|
|
Test::More is used widely. It's highly improbable it can cause any errors. Regarding the wide characters, maybe all you needed was Test::More::UTF8?
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
|
|
| [reply] |
|
|
Well, I finally got around to the troublesome matter of the testing script for the module, and found I was forced to use one of the tools for testing...so, Test::More::UTF8 it was.
...until it wasn't.
Something is different in the coding for the UTF8, it appears, and the script failed to run. Rather than more hours trying to troubleshoot what is unfamiliar to me, I've abandoned the /t folder for testing and gone the other route of having my own test.pl script in the main folder for the module. In that I use "Test::Simple" for a few simple tests, then run my own tests. It seems to work okay, but I wish it were better.
#!/usr/bin/perl
use strict;
use warnings;
use 5.008;
use utf8;
use FindBin qw($Bin);
use lib "$Bin/lib";
use blib './lib/';
use Regexp::CharClasses::Thai;
use Regexp::CharClasses::Thai qw(:all);
binmode STDOUT, ':utf8';
use Test::Simple 'no_plan';
my $failure = 0;
#########################
# TEST ITEMS
ok( q"'ก' =~ /\p{IsThai}/" );
ok( q"'ก' =~ /\p{InThaiCons}/" );
ok( q"'ก' =~ /\p{InThaiMCons}/" );
is( q"'ก' =~ /\p{IsKokai}/",1,' Match for "ก" =~ /\p{IsKokai}/');
is( q"'ก' =~ /\p{InThai}/",1,' Match for "ก" =~ /\p{InThai}/');
is( q"'ก' =~ /\p{InThaiAlpha}/",1,' Match for "ก" =~ /\p{InThaiAlpha}/');
is( q"'ก' =~ /\p{InThaiCons}/",1,' Match for "ก" =~ /\p{InThaiCons}/');
isnt( q"'ก' =~ /\p{InThaiHCons}/",0,' No match for "ก" =~ /\p{InThaiHCons}/');
is( q"'ก' =~ /\p{InThaiMCons}/",1,' Match for "ก" =~ /\p{InThaiMCons}/');
isnt( q"'ก' =~ /\p{InThaiLCons}/",0,' No match for "ก" =~ /\p{InThaiLCons}/');
isnt( q"'ก' =~ /\p{InThaiDigit}/",0,' No match for "ก" =~ /\p{InThaiDigit}/');
isnt( q"'ก' =~ /\p{InThaiTone}/",0,' No match for "ก" =~ /\p{InThaiTone}/');
isnt( q"'ก' =~ /\p{InThaiVowel}/",0,' No match for "ก" =~ /\p{InThaiVowel}/');
isnt( q"'ก' =~ /\p{InThaiCompVowel}/",0,' No match for "ก" =~ /\p{InThaiCompVowel}/');
isnt( q"'ก' =~ /\p{InThaiPreVowel}/",0,' No match for "ก" =~ /\p{InThaiPreVowel}/');
isnt( q"'ก' =~ /\p{InThaiPostVowel}/",0,' No match for "ก" =~ /\p{InThaiPostVowel}/');
isnt( q"'ก' =~ /\p{InThaiPunct}/",0,' No match for "ก" =~ /\p{InThaiPunct}/');
is( q"'ก' =~ /\p{InThaiFinCons}/",1,' Match for "ก" =~ /\p{InThaiFinCons}/');
isnt( q"'ก' =~ /\p{InThaiMute}/",0,' No match for "ก" =~ /\p{InThaiMute}/');
is( q"'ไ' =~ /\p{InThai}/",1,' Match for "ไ" =~ /\p{InThai}/');
is( q"'ไ' =~ /\p{InThaiAlpha}/",1,' Match for "ไ" =~ /\p{InThaiAlpha}/');
is( q"'ไ' =~ /\p{InThaiWord}/",1,' Match for "ไ" =~ /\p{InThaiWord}/');
isnt( q"'ไ' =~ /\p{InThaiCons}/",0,' No match for "ไ" =~ /\p{InThaiCons}/');
isnt( q"'ไ' =~ /\p{InThaiHCons}/",0,' No match for "ไ" =~ /\p{InThaiHCons}/');
isnt( q"'ไ' =~ /\p{InThaiMCons}/",0,' No match for "ไ" =~ /\p{InThaiMCons}/');
isnt( q"'ไ' =~ /\p{InThaiLCons}/",0,' No match for "ไ" =~ /\p{InThaiLCons}/');
isnt( q"'ไ' =~ /\p{InThaiDigit}/",0,' No match for "ไ" =~ /\p{InThaiDigit}/');
isnt( q"'ไ' =~ /\p{InThaiTone}/",0,' No match for "ไ" =~ /\p{InThaiTone}/');
is( q"'ไ' =~ /\p{InThaiVowel}/",1,' Match for "ไ" =~ /\p{InThaiVowel}/');
isnt( q"'ไ' =~ /\p{InThaiCompVowel}/",0,' No match for "ไ" =~ /\p{InThaiCompVowel}/');
is( q"'ไ' =~ /\p{InThaiPreVowel}/",1,' Match for "ไ" =~ /\p{InThaiPreVowel}/');
isnt( q"'ไ' =~ /\p{InThaiPostVowel}/",0,' No match for "ไ" =~ /\p{InThaiPostVowel}/');
isnt( q"'ไ' =~ /\p{InThaiPunct}/",0,' No match for "ไ" =~ /\p{InThaiPunct}/');
is( q"'ไ' =~ /\p{IsSaraaimaimalai}/",1,' Match for "ไ" =~ /\p{IsSaraaimaimalai}/');
my $pv = 'ข่าวนี้ได้แพร่สะพัดออกไปอย่างรวดเร็ว';
my $prevowel_syllables = $pv =~ s/
(
(?:\p{InThaiPreVowel})
(?:
(?:\p{InThaiDualC1}\p{InThaiDualC2})
|
(?:\p{InThaiCons}){1}
)
(?:\p{InThaiTone}\p{InThaiCompVowel}\p{InThaiPostVowel}){0,3}
(?:
(?:\p{InThaiFinCons}\p{IsYoyak}\p{IsWowaen}){0,5}
(?!\p{InThaiPostVowel})
)*
(?:\p{InThaiMute})?
)
/($1)/gx;
print "Syllables with pre-vowels in 'ข่าวนี้ได้แพร่สะพัดออกไปอย่างรวดเร็ว' --> $pv: $prevowel_syllables\n"; # 4
if ($prevowel_syllables == 4) { print "Syllables test succeeded.\n\n" } else { print "Syllables test FAILED.\n\n"; $failure++};
if ($failure) {
print "No success: $failure tests failed.\n";
exit $failure;
} else {
print "Success. All tests passed.\n";
exit 0;
};
exit;
sub is {
my $test = shift @_;
my $val = shift @_;
my $say = shift @_;
print "TEST: $say\t";
if ((eval($test)) == $val) {
print "Passed in the affirmative.\n"
} else {
print "FAILED! INCORRECTLY NEGATIVE.\n";
$failure++
};
};
sub isnt {
my $test = shift @_;
my $val = shift @_;
my $say = shift @_;
print "TEST: $say\t";
if (eval($test) != $val) {
print "FAILED! INCORRECTLY AFFIRMATIVE.\n";
$failure++
} else {
print "Passed in the negative.\n"
};
};
EDIT: Cleaned it up a bit and changed to "pre" tags, hoping for better readability.
Maybe someday I'll figure out the Test::More::UTF8. Until then, this approach will hopefully at least get the module installed.
| [reply] |
|
|
|
|
|