FCGI, tied handles and wide characters

Maelstrom has asked for the wisdom of the Perl Monks concerning the following question:

I thought my knowledge of unicode handling in perth had evolved to the point where I could include a unicode character in my template files but I thought very wrong. I've been finding the message "Use of wide characters in FCGI::Stream::PRINT is deprecated and will stop working in a future version of FCGI" in my server logs and no amount of setting STDOUT to utf8 encoding through various pragmas or binmode invocations has been able to remove it. Extensive googling turned up the possibility that the FCGI has no concept of Unicode handling and the most thorough explanation I could find was this rant at https://www.apachelounge.com/viewtopic.php?p=37483 which posits that the use of tied handles makes changing STDOUT ineffective and the only solution was to roll your own print routine like..

sub my_print($)
{
        my($string) = @_;
        my $is_utf8 = is_utf8(${$string});
        _utf8_off(${$string}) if($is_utf8);
        print ${$string};
        _utf8_on(${$string}) if($is_utf8);
}
[download]

I could already hear the lynch mobs baying about the scandalous use of _utf8_off and while at this point worrying about making unicode handling even more broken seemed like an academic distinction even I was troubled about turning off the utf8 flag on a string that I knew actually was utf8 but luckily I only had to get halfway through that doomed solution as it turns out my template module will take a file handle to use instead of STDOUT and I could print to the string to get it flagged as UTF like so...

my $string;
open(my $out, "+<", \$string);
binmode $out, ':encoding(UTF-8)';
$page->output($out);
close($out);
print $string;
[download]

Without the binmode on the strings file handle I was getting the generic wide characters warnings from the perl interpreter and nothing from FCGI but with it everything is now working. I came here to ask WTF is going on but in the course of writing out the question I think I actually figured it out but I'm going to post anyway in case it helps the next person trying to find an answer on google. In fact I'm now thinking that use v5.14; in my template module instead of just the main script would also have solved this problem although it feels counter-intuitive that unicode handling in my module isn't inherited from the main script. If I have to find a question I guess maybe it's turned into should I just use feature 'unicode_strings' (or use v5.14) in all libraries going forward or does this just cause the same problem for future users calling the modules from a script that doesn't have that? Ideally I would set a configuration switch to turn it on or off but I don't think use pragmas can be set like that.

Comment on FCGI, tied handles and wide characters Select or Download Code

Replies are listed 'Best First'.
Re: FCGI, tied handles and wide characters by ikegami (Patriarch) on Sep 09, 2024 at 11:06 UTC
STDOUT isn't really a file handle in your case. It's a tied object that presents the interface of a handle, but isn't actually. And layers (such as `:encoding(UTF-8)`) aren't supported by a tied handles. So, rather than relying on an encoding layer, encode explicitly. I could already hear the lynch mobs baying about the scandalous use of _utf8_off And rightly so, since you're effectively encoding the the scalar using utf8 when `is_utf8` is true, but you fail to do so when `is_utf8` is false. `sub my_print($) { my($string) = @_; my $is_utf8 = is_utf8(${$string}); _utf8_off(${$string}) if($is_utf8); print ${$string}; _utf8_on(${$string}) if($is_utf8); } my_print( \$string );` [download] should be `sub my_print($) { my $string_ref = shift; my $string = $$string_ref; utf8::encode( $string ); print $string; } my_print( \$string );` [download] Better yet, `sub my_print { my $s = join( $,, @_ ) . $\; utf8::encode( $s ); print( $s ); } my_print( $string );` [download]	[reply] [d/l] [select]
Re^2: FCGI, tied handles and wide characters by cavac (Prior) on Sep 09, 2024 at 15:00 UTC
STDOUT isn't really a file handle in your case. It's a tied object that presents the interface of a handle, but isn't actually. And layers (such as :encoding(UTF-8)) aren't supported by a tied handles. I rather suspect that a similar problem could lurk in incoming data as well. I would certainly check if incoming Umlauts, Emojis and other Unicode stuff gets decoded correctly. In the long run, it might also pay to run some Unicode normalization to make sure the same text is always encoded the same way (especially for usernames, passwords and such). Unicode equivalence can be rather annoying sometimes, see also: incorrect length of strings with diphthongs. PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP Also check out my sisters artwork and my weekly webcomics	[reply]
Re^3: FCGI, tied handles and wide characters by ikegami (Patriarch) on Sep 09, 2024 at 18:48 UTC
Definitely. `FCGI::Request` replaces STDIN, STDOUT and STDERR by default.	[reply] [d/l]
Re: FCGI, tied handles and wide characters by hippo (Archbishop) on Sep 09, 2024 at 12:54 UTC
Ideally I would set a configuration switch to turn it on or off but I don't think use pragmas can be set like that. That rather depends on the specifics of "a configuration switch" but `use` can certainly be made conditional upon various things, eg. an environment variable: `use if $ENV{UNISTR}, 'feature' => 'unicode_strings';` [download] Note that I suspect that the `unicode_strings` feature is not the deciding factor in your case anyway but without an SSCCE it is hard to say for certain. 🦛	[reply] [d/l] [select]
Re^2: FCGI, tied handles and wide characters by ikegami (Patriarch) on Sep 09, 2024 at 14:11 UTC
Correct. `unicode_strings` affects things that care whether characters in the U+80 to U+FF range are letters or not, or whitespace or not. This includes `uc`, `split` and the regex engine. It doesn't affect I/O. `$ t() { perl -e' use open ":std", ":encoding(UTF-8)"; use if $ARGV[0], feature => "unicode_strings"; CORE::say uc "\xE9"; ' -- "$@" } $ t 0 é $ t 1 É` [download] `$ t() { perl -e' use open ":std", ":encoding(UTF-8)"; use if $ARGV[0], feature => "unicode_strings"; CORE::say join "\|", split " ", "a\x20b\xA0c"; ' -- "$@" } $ t 0 a\|b c $ t 1 a\|b\|c` [download]	[reply] [d/l] [select]
Re^2: FCGI, tied handles and wide characters by Maelstrom (Beadle) on Sep 09, 2024 at 14:43 UTC
That's interesting but what I meant was exporting the use pragma after the compilation stage and instead when a specific function is called which is impossible to my knowledge. However I could probably take an option in my constructor like `Template->new(UNICODE=>1)` and modify Ikegami's my_print to say `if ($self->{'UNICODE'}) { utf8::encode( $s ); }` and maybe even throw in a binmode on STDOUT in the constructor while I'm at it although it wouldn't help with the FCGI situation. It turns out adding the `unicode_strings` feature to my Template module didn't work after all so I guess I will need to explicitly encode at some point. Sadly my template module is about a 1000 lines past being an SSCCE but the gist of it is it slurps a file into an array and constructs a hash of arrays which are interpolated then printed based on conditionals being met. I am using `:encoding(UTF-8)` when I slurp the file so between that and `unicode_strings` it's a puzzle why that's not enough to have the text encoded and flagged as utf8.	[reply] [d/l] [select]
Re^3: FCGI, tied handles and wide characters by hippo (Archbishop) on Sep 11, 2024 at 08:26 UTC
Sadly my template module is about a 1000 lines past being an SSCCE The point of the SSCCE is that you strip out the 980 lines which are irrelevant to the problem at hand and just show us the self-contained, minimal script of 20 lines or less which reproduces the warning. I am using `:encoding(UTF-8)` when I slurp the file so ... it's a puzzle why that's not enough to have the text encoded and flagged as utf8. If you use that when slurping your input it will be decoded (not encoded). This is so that your code can then work on it as character data. Once you have finished munging it and are ready to output it, that is the point at which it needs to be encoded (and from the warning in the OP, this is the step which is missing). 🦛	[reply] [d/l]
Re^4: FCGI, tied handles and wide characters by Maelstrom (Beadle) on Sep 12, 2024 at 00:52 UTC
Re^5: FCGI, tied handles and wide characters by hippo (Archbishop) on Sep 12, 2024 at 11:34 UTC
Some notes below your chosen depth have not been shown here
Re^5: FCGI, tied handles and wide characters (snowflake obfus and emojis) by eyepopslikeamosquito (Archbishop) on Sep 13, 2024 at 09:38 UTC