Re^2: Listing out the characters included in a character class [wide character warning]

Replies are listed 'Best First'.
Re^3: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 01, 2023 at 11:40 UTC
The similarities among Thai characters is one of the reasons a Thai reader who can read without stumbling is rare. The biggest reason is the lack of spaces between words (and this is one of the reasons these character classes would be so helpful, as they could help with word-splitting or syllable-splitting Thai text). I've never heard of a Thai speed reader, and I do not believe it is possible. That said, there is definitely a difference between the otherwise similar characters in usage and in pronunciation; and experienced readers will be able to guess the word without seeing the minute details. Others, like myself, prefer to have a larger font size so as to see those minutiae more clearly. None of these issues, however, affect the script at present. I have gone over the codepoints with a virtual fine-toothed comb, rechecking all of them. I did make a couple of corrections in the process, one minor one being the removal of the unassigned codepoints from one of the code blocks. There remain some edge cases which the script does not address--perhaps in the future a Thai linguist of superior skill might suggest additional functionality. I feel more comfortable with the Thai side of this script, at present, than with the Perl side (which simply refuses to work). It is to contribute to the Thai programming community that I do this; and I much appreciate the Perl gurus who are able to help on that side of things. Blessings, ~Polyglot~	[reply]
Re^3: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 03, 2023 at 03:23 UTC
Incidentally, I am no longer using `use Test::More;`. I discovered that it was the source of all of my errors, including all of the "wide character" log messages, and my code is working well now without it--zero errors being logged. Apparently, `Test::More` was not designed to be compatible with unicode characters, and is therefore not fit for purpose for my script. I had planned to use it for the module testing, as is recommended in the guides for module preparation. Now, I'm not sure what to do. How can one go about testing his own script without this module? More importantly, how does one ensure that the installation will not fail in the absence of such a testing environment? I'm nearly ready to wrap up with the creation of the module but for details of this nature. Packaging for CPAN is a bit cumbersome--at least for the first time around while learning the ropes. Blessings, ~Polyglot~	[reply] [d/l] [select]
Re^4: Listing out the characters included in a character class [wide character warning] by pryrt (Abbot) on Nov 03, 2023 at 13:36 UTC
I discovered that it was the source of all of my errors The errors ("Premature end of script headers") are logged because the warnings are being printed before your HTTP headers, because while you took my advice to wrap the headers in a BEGIN block, you skipped the part of my advice where I explained that the BEGIN block might need to go before certain modules were even used. But getting that right is difficult, which is why I also suggested `use CGI::Carp qw(fatalsToBrowser);` (because that might help with your debug process). The warnings ("wide character") are because your script didn't set up the appropriate `binmode`/`open`-mode for all the various outputs that are used -- and other monks have given you better advice than I could on that, including Test::More::UTF8 , which I had never heard of, but will definitely keep in my arsenal going forward. I had planned to use it for the module testing I will admit that I haven't tested a CGI script, per se. But using Test::More inside the script that's generating the response to the browser seems weird to me. Normally tests are run from the command line (not on your live webserver), and the test script will call the various functions from the modules you wrote that your CGI script is calling. And, if you end up with a lot of logic/etc inside your CGI script that needs testing, you could even have a test that runs your CGI script (CGI can be run from the command line, without the involvement of the webserver -- even the old CGI.pm documentation explained how to do that). Or you could even have an HTTP client inside your test script, which would connect to the webserver to test the endpoints of your CGI (testing on your live server is probably not the best either, but you could have your test suite launch a private webserver instance on your test machine, without it being on your final webhost yet). But Test::More and the TAP protocol put some of the output (like the test name and ok/not-ok) to STDOUT, but also puts information (like the failure diagnostics) to STDERR -- and trying to properly handle the TAP output inside the CGI environment to generate a valid webpage seems tricky, at best, and I am convinced that (not a problem in Test::More itself) is the cause of your headaches.	[reply] [d/l] [select]
Re^5: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 04, 2023 at 12:30 UTC
I had not intended the CGI script to be used for the official package testing--it was for my own testing, for matters of convenience at my end (think editing in TextWrangler instead of via 'nano' on the server). However, UTF8 characters are still UTF8 characters, whether printed from a CGI script or from one run at the command terminal. This should not matter at all. Neither had I ever heard of Test::More::UTF8. It's hard to use what is not known. Blessings, ~Polyglot~	[reply]
Re^4: Listing out the characters included in a character class [wide character warning] by eyepopslikeamosquito (Archbishop) on Nov 03, 2023 at 06:45 UTC
I am no longer using `use Test::More;`. I discovered that it was the source of all of my errors Perl testing is based on the Test Anything Protocol, which is also used to test languages other than Perl ... so there is no requirement for your module to use Test::More. That said, it would be nice if you could provide us with a SSCCE that clearly illustrates the problems you were experiencing with Test::More. See also: hippo's excellent Basic Testing Tutorial 👁️🍾👍🦟	[reply] [d/l]
Re^5: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 03, 2023 at 07:24 UTC
Unfortunately, I do not think it is possible to provide a correct SSCCE here for this case. The forum converts all of the characters which are related to the problem to HTML-entities, and I am unaware of a method by which the actual files could be attached. Suffice it for now that the issue is caused by UTF8 embedded in the code and tested by the Test::More tests. Without a way to paste in actual code containing UTF8 characters, unmangled, I see no point in going to the trouble of forming up an SSCCE for this case. I doubt it would be likely to exhibit the same behaviors, post-transfer/conversion, and would thus prove little. Blessings, ~Polyglot~	[reply]
Re^6: Listing out the characters included in a character class [wide character warning] by pryrt (Abbot) on Nov 03, 2023 at 13:46 UTC
Re^7: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 04, 2023 at 13:18 UTC
Re^4: Listing out the characters included in a character class [wide character warning] by kcott (Archbishop) on Nov 03, 2023 at 09:58 UTC
"`Wide character in ...`" is a warning. See "perldiag: Wide character in %s". Please stop calling it an error. You showed this warning when using `Test::More`: `Wide character in print at /.../Test2/Formatter/TAP.pm line 125.` [download] I simulated that warning when using `Test::More`: `Wide character in print at /.../Test2/Formatter/TAP.pm line 156.` [download] The only difference being the line number which I'd guess, in the absence of other information, is due to you using a different version. `Test::More` and `Test2::Formatter::TAP` (along with many other modules) are part of the Test-Simple distribution. I'm using: `$ perl -E 'use Test::More; say $Test::More::VERSION;' 1.302195 $ perl -E 'use Test2::Formatter::TAP; say $Test2::Formatter::TAP::VERS +ION;' 1.302195` [download] What version are you using? My line 156 looks like this: `print $io $ok;` [download] What does your line 125 look like? I provided you with a solution to your problem by using: `use open OUT => qw{:encoding(UTF-8) :std};` [download] Did you try that? If so, what was the outcome? If not, why not? The issue here is in no way specific to `Test::More`. Consider this code which generates the warning: $ perl -e ' print "\N{DROMEDARY CAMEL}\n"; ' Wide character in print at -e line 2. 🐪 And this code which does not: $ perl -e ' use open OUT => qw{:encoding(UTF-8) :std}; print "\N{DROMEDARY CAMEL}\n"; ' 🐪 — Ken	[reply] [d/l] [select]
Re^5: Listing out the characters included in a character class [wide character warning] by choroba (Cardinal) on Nov 03, 2023 at 10:10 UTC
Note that ordering is important: #!/usr/bin/perl use warnings; use strict; use utf8; use Test::More tests => 1; use open OUT => ':encoding(UTF-8)', ':std'; is "kůň", 1, 'same'; gives the warning, while #!/usr/bin/perl use warnings; use strict; use utf8; use open OUT => ':encoding(UTF-8)', ':std'; use Test::More tests => 1; is "kůň", 1, 'same'; does not. That's why I recommended Test::More::UTF8. You can place it wherever you like and there are no warnings. Update: `<code>` to `<pre>` to fix the non-English characters. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^6: Listing out the characters included in a character class [wide character warning] by kcott (Archbishop) on Nov 03, 2023 at 10:22 UTC
Re^6: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 04, 2023 at 12:03 UTC
Re^5: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 04, 2023 at 11:55 UTC
Did you try that? If so, what was the outcome? If not, why not? Yes, I copied that into my code, replacing what I was doing (TMTOWTDI) and it made no difference. I had been using these lines already: `binmode STDERR, ":utf8"; binmode STDIN, ":utf8"; binmode STDOUT, ":utf8";` [download] Honestly, there are so many ways in Perl of dealing with UTF8 that the mind spins--and they are not all created equal. I had not seen the particular method you recommended, but again, it turned out no different than what I had had in place already. If it does the same as the three lines, I may prefer it going forward. The three-line version does appear to have the advantage of being more specific, giving one the option of selecting among the three options. And I've, at various times, used other methods as well, including the Encode module, etc. What does your line 125 look like? `print $io $msg;` [download] ...and my line 156 looks identical to yours. Regarding the last portion of your post, I know you are well-meaning so I will overlook how it comes across. My username is not without significance. From the very beginning of my Perl programming career, I have dealt with non-ASCII encodings (I was programming for Asian languages from the get-go). The "wide character" message is one I have seen thousands of times--and I well know its typical causes. Thank you for your help! (This is genuine, not being sarcastic--I just felt it necessary to clarify owing to the prior paragraph which might color the perception of my tone.) Blessings, ~Polyglot~	[reply] [d/l] [select]
Re^4: Listing out the characters included in a character class [wide character warning] by choroba (Cardinal) on Nov 03, 2023 at 09:59 UTC
Test::More is used widely. It's highly improbable it can cause any errors. Regarding the wide characters, maybe all you needed was Test::More::UTF8? `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^5: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 04, 2023 at 11:59 UTC
If the Test::More were already capable of handling UTF8 characters, why the need of Test::More::UTF8? And if, for a UTF8 application, the second option is to be used, why is this not made more prominent? I saw nothing about it in the documentation for Test::More, and it's hard to even know of its existence as advertised. In any case, it's nice to know that there was something better. At this point, I may just roll my own tests anyhow...we'll see. (I don't need anything overly complex to begin with.) Blessings, ~Polyglot~	[reply]
Re^5: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 28, 2023 at 07:01 UTC
Well, I finally got around to the troublesome matter of the testing script for the module, and found I was forced to use one of the tools for testing...so, Test::More::UTF8 it was. ...until it wasn't. Something is different in the coding for the UTF8, it appears, and the script failed to run. Rather than more hours trying to troubleshoot what is unfamiliar to me, I've abandoned the /t folder for testing and gone the other route of having my own test.pl script in the main folder for the module. In that I use "Test::Simple" for a few simple tests, then run my own tests. It seems to work okay, but I wish it were better. #!/usr/bin/perl use strict; use warnings; use 5.008; use utf8; use FindBin qw($Bin); use lib "$Bin/lib"; use blib './lib/'; use Regexp::CharClasses::Thai; use Regexp::CharClasses::Thai qw(:all); binmode STDOUT, ':utf8'; use Test::Simple 'no_plan'; my $failure = 0; ######################### # TEST ITEMS ok( q"'ก' =~ /\p{IsThai}/" ); ok( q"'ก' =~ /\p{InThaiCons}/" ); ok( q"'ก' =~ /\p{InThaiMCons}/" ); is( q"'ก' =~ /\p{IsKokai}/",1,' Match for "ก" =~ /\p{IsKokai}/'); is( q"'ก' =~ /\p{InThai}/",1,' Match for "ก" =~ /\p{InThai}/'); is( q"'ก' =~ /\p{InThaiAlpha}/",1,' Match for "ก" =~ /\p{InThaiAlpha}/'); is( q"'ก' =~ /\p{InThaiCons}/",1,' Match for "ก" =~ /\p{InThaiCons}/'); isnt( q"'ก' =~ /\p{InThaiHCons}/",0,' No match for "ก" =~ /\p{InThaiHCons}/'); is( q"'ก' =~ /\p{InThaiMCons}/",1,' Match for "ก" =~ /\p{InThaiMCons}/'); isnt( q"'ก' =~ /\p{InThaiLCons}/",0,' No match for "ก" =~ /\p{InThaiLCons}/'); isnt( q"'ก' =~ /\p{InThaiDigit}/",0,' No match for "ก" =~ /\p{InThaiDigit}/'); isnt( q"'ก' =~ /\p{InThaiTone}/",0,' No match for "ก" =~ /\p{InThaiTone}/'); isnt( q"'ก' =~ /\p{InThaiVowel}/",0,' No match for "ก" =~ /\p{InThaiVowel}/'); isnt( q"'ก' =~ /\p{InThaiCompVowel}/",0,' No match for "ก" =~ /\p{InThaiCompVowel}/'); isnt( q"'ก' =~ /\p{InThaiPreVowel}/",0,' No match for "ก" =~ /\p{InThaiPreVowel}/'); isnt( q"'ก' =~ /\p{InThaiPostVowel}/",0,' No match for "ก" =~ /\p{InThaiPostVowel}/'); isnt( q"'ก' =~ /\p{InThaiPunct}/",0,' No match for "ก" =~ /\p{InThaiPunct}/'); is( q"'ก' =~ /\p{InThaiFinCons}/",1,' Match for "ก" =~ /\p{InThaiFinCons}/'); isnt( q"'ก' =~ /\p{InThaiMute}/",0,' No match for "ก" =~ /\p{InThaiMute}/'); is( q"'ไ' =~ /\p{InThai}/",1,' Match for "ไ" =~ /\p{InThai}/'); is( q"'ไ' =~ /\p{InThaiAlpha}/",1,' Match for "ไ" =~ /\p{InThaiAlpha}/'); is( q"'ไ' =~ /\p{InThaiWord}/",1,' Match for "ไ" =~ /\p{InThaiWord}/'); isnt( q"'ไ' =~ /\p{InThaiCons}/",0,' No match for "ไ" =~ /\p{InThaiCons}/'); isnt( q"'ไ' =~ /\p{InThaiHCons}/",0,' No match for "ไ" =~ /\p{InThaiHCons}/'); isnt( q"'ไ' =~ /\p{InThaiMCons}/",0,' No match for "ไ" =~ /\p{InThaiMCons}/'); isnt( q"'ไ' =~ /\p{InThaiLCons}/",0,' No match for "ไ" =~ /\p{InThaiLCons}/'); isnt( q"'ไ' =~ /\p{InThaiDigit}/",0,' No match for "ไ" =~ /\p{InThaiDigit}/'); isnt( q"'ไ' =~ /\p{InThaiTone}/",0,' No match for "ไ" =~ /\p{InThaiTone}/'); is( q"'ไ' =~ /\p{InThaiVowel}/",1,' Match for "ไ" =~ /\p{InThaiVowel}/'); isnt( q"'ไ' =~ /\p{InThaiCompVowel}/",0,' No match for "ไ" =~ /\p{InThaiCompVowel}/'); is( q"'ไ' =~ /\p{InThaiPreVowel}/",1,' Match for "ไ" =~ /\p{InThaiPreVowel}/'); isnt( q"'ไ' =~ /\p{InThaiPostVowel}/",0,' No match for "ไ" =~ /\p{InThaiPostVowel}/'); isnt( q"'ไ' =~ /\p{InThaiPunct}/",0,' No match for "ไ" =~ /\p{InThaiPunct}/'); is( q"'ไ' =~ /\p{IsSaraaimaimalai}/",1,' Match for "ไ" =~ /\p{IsSaraaimaimalai}/'); my $pv = 'ข่าวนี้ได้แพร่สะพัดออกไปอย่างรวดเร็ว'; my $prevowel_syllables = $pv =~ s/ ( (?:\p{InThaiPreVowel}) (?: (?:\p{InThaiDualC1}\p{InThaiDualC2}) \| (?:\p{InThaiCons}){1} ) (?:\p{InThaiTone}\p{InThaiCompVowel}\p{InThaiPostVowel}){0,3} (?: (?:\p{InThaiFinCons}\p{IsYoyak}\p{IsWowaen}){0,5} (?!\p{InThaiPostVowel}) )* (?:\p{InThaiMute})? ) /($1)/gx; print "Syllables with pre-vowels in 'ข่าวนี้ได้แพร่สะพัดออกไปอย่างรวดเร็ว' --> $pv: $prevowel_syllables\n"; # 4 if ($prevowel_syllables == 4) { print "Syllables test succeeded.\n\n" } else { print "Syllables test FAILED.\n\n"; $failure++}; if ($failure) { print "No success: $failure tests failed.\n"; exit $failure; } else { print "Success. All tests passed.\n"; exit 0; }; exit; sub is { my $test = shift @_; my $val = shift @_; my $say = shift @_; print "TEST: $say\t"; if ((eval($test)) == $val) { print "Passed in the affirmative.\n" } else { print "FAILED! INCORRECTLY NEGATIVE.\n"; $failure++ }; }; sub isnt { my $test = shift @_; my $val = shift @_; my $say = shift @_; print "TEST: $say\t"; if (eval($test) != $val) { print "FAILED! INCORRECTLY AFFIRMATIVE.\n"; $failure++ } else { print "Passed in the negative.\n" }; }; EDIT: Cleaned it up a bit and changed to "pre" tags, hoping for better readability. Maybe someday I'll figure out the Test::More::UTF8. Until then, this approach will hopefully at least get the module installed. Blessings, ~Polyglot~	[reply]
Re^6: Listing out the characters included in a character class [wide character warning] by choroba (Cardinal) on Nov 28, 2023 at 08:04 UTC
Re^7: Listing out the characters included in a character class [wide character warning] by Polyglot (Chaplain) on Nov 28, 2023 at 08:25 UTC
Some notes below your chosen depth have not been shown here