Re: FCGI, tied handles and wide characters

Replies are listed 'Best First'.
Re^2: FCGI, tied handles and wide characters by ikegami (Patriarch) on Sep 09, 2024 at 14:11 UTC
Correct. `unicode_strings` affects things that care whether characters in the U+80 to U+FF range are letters or not, or whitespace or not. This includes `uc`, `split` and the regex engine. It doesn't affect I/O. `$ t() { perl -e' use open ":std", ":encoding(UTF-8)"; use if $ARGV[0], feature => "unicode_strings"; CORE::say uc "\xE9"; ' -- "$@" } $ t 0 é $ t 1 É` [download] `$ t() { perl -e' use open ":std", ":encoding(UTF-8)"; use if $ARGV[0], feature => "unicode_strings"; CORE::say join "\|", split " ", "a\x20b\xA0c"; ' -- "$@" } $ t 0 a\|b c $ t 1 a\|b\|c` [download]	[reply] [d/l] [select]
Re^2: FCGI, tied handles and wide characters by Maelstrom (Beadle) on Sep 09, 2024 at 14:43 UTC
That's interesting but what I meant was exporting the use pragma after the compilation stage and instead when a specific function is called which is impossible to my knowledge. However I could probably take an option in my constructor like `Template->new(UNICODE=>1)` and modify Ikegami's my_print to say `if ($self->{'UNICODE'}) { utf8::encode( $s ); }` and maybe even throw in a binmode on STDOUT in the constructor while I'm at it although it wouldn't help with the FCGI situation. It turns out adding the `unicode_strings` feature to my Template module didn't work after all so I guess I will need to explicitly encode at some point. Sadly my template module is about a 1000 lines past being an SSCCE but the gist of it is it slurps a file into an array and constructs a hash of arrays which are interpolated then printed based on conditionals being met. I am using `:encoding(UTF-8)` when I slurp the file so between that and `unicode_strings` it's a puzzle why that's not enough to have the text encoded and flagged as utf8.	[reply] [d/l] [select]
Re^3: FCGI, tied handles and wide characters by hippo (Archbishop) on Sep 11, 2024 at 08:26 UTC
Sadly my template module is about a 1000 lines past being an SSCCE The point of the SSCCE is that you strip out the 980 lines which are irrelevant to the problem at hand and just show us the self-contained, minimal script of 20 lines or less which reproduces the warning. I am using `:encoding(UTF-8)` when I slurp the file so ... it's a puzzle why that's not enough to have the text encoded and flagged as utf8. If you use that when slurping your input it will be decoded (not encoded). This is so that your code can then work on it as character data. Once you have finished munging it and are ready to output it, that is the point at which it needs to be encoded (and from the warning in the OP, this is the step which is missing). 🦛	[reply] [d/l]
Re^4: FCGI, tied handles and wide characters by Maelstrom (Beadle) on Sep 12, 2024 at 00:52 UTC
Well it's clear my understanding of decoding and encoding were completely the wrong way around. I really wish I'd looked at your second paragraph more closely before spending the entire night trying to just bypass the encoding completely and setting the utf8 flag manually since my assumption was if I know I have a text file with UTF-8 encoding and perl uses utf-8 internally when told to it should be as simple as doing a sysread into a scalar and flagging it as utf8 so I created the file test_utf8 and set it's contents to this has wide utf8 chars like ❇ (snowflake) and tested with the following SSCCE. `use utf8; sub fileread; # use do { local $/; <$fh> } my $file = 'test_utf8'; # Test 1 binmode STDOUT, ':encoding(UTF-8)'; my $line = fileread $file,':raw'; utf8::decode($line); if ($line =~ /(❇)/) { print "found '$1'\n"; } print $line; sub fileread { my ($file,$enc) = @_; my $string; my $stref = \$string; open(my $fh, "< $enc", $file) \|\| die "Can't open $file: $!"; ${$stref} = do { local $/; <$fh> }; return $string; }` [download] This prints found '❇' this has wide utf8 chars like ❇ (snowflake) which is the desired behaviour but other tests produce more puzzling results. For example using `fileread $file,':encoding(UTF-8)';` or `fileread $file,':encoding(ISO-8859-1)'` produced identical results but the following test `my $line = fileread $file,':encoding(UTF-8)'; $line = Encode::decode('UTF-8', $line, 'Encode::FB_CROAK');` [download] Was (I'm sure) producing `this has wide utf8 chars like âť (snowflake)` a few hours ago but is now crashing the script giving `Undefined subroutine &Encode::decode called at - line 18.` if binmode is commented out and `Wide character at - line 18.` if it isn't. Maybe it was `utf8::encode` giving me the first line, things are getting kinda hazy at this point. It does produce the correct result when used with `fileread $file,':raw'` or `fileread $file,':encoding(ISO-8859-1)'`. Interestingly `unicode_strings` made no difference to the regex succeeding or failing in any of my tests as and `utf8::upgrade/downgrade` don't appear to do anything at all in this SSCCE. It would be nice to conclude that when in doubt just use `utf8::decode` but I've also been testing with `Net::Async::FastCGI` which also gives me a tied STDOUT only it does UTF-8 encoding on it which I need to turn off with `set_encoding( undef );` if I do that. ps I notice all the occurrences of ❇ in my code blocks have been turned into `❇` so it's some small comfort that perlmonks.org can't quite get a grip on this either. 😜	[reply] [d/l] [select]
Re^5: FCGI, tied handles and wide characters by hippo (Archbishop) on Sep 12, 2024 at 11:34 UTC
Re^6: FCGI, tied handles and wide characters by Maelstrom (Beadle) on Sep 14, 2024 at 03:01 UTC
Some notes below your chosen depth have not been shown here
Re^5: FCGI, tied handles and wide characters (snowflake obfus and emojis) by eyepopslikeamosquito (Archbishop) on Sep 13, 2024 at 09:38 UTC