Re^3: FCGI, tied handles and wide characters

Replies are listed 'Best First'.
Re^4: FCGI, tied handles and wide characters by Maelstrom (Beadle) on Sep 12, 2024 at 00:52 UTC
Well it's clear my understanding of decoding and encoding were completely the wrong way around. I really wish I'd looked at your second paragraph more closely before spending the entire night trying to just bypass the encoding completely and setting the utf8 flag manually since my assumption was if I know I have a text file with UTF-8 encoding and perl uses utf-8 internally when told to it should be as simple as doing a sysread into a scalar and flagging it as utf8 so I created the file test_utf8 and set it's contents to this has wide utf8 chars like ❇ (snowflake) and tested with the following SSCCE. `use utf8; sub fileread; # use do { local $/; <$fh> } my $file = 'test_utf8'; # Test 1 binmode STDOUT, ':encoding(UTF-8)'; my $line = fileread $file,':raw'; utf8::decode($line); if ($line =~ /(❇)/) { print "found '$1'\n"; } print $line; sub fileread { my ($file,$enc) = @_; my $string; my $stref = \$string; open(my $fh, "< $enc", $file) \|\| die "Can't open $file: $!"; ${$stref} = do { local $/; <$fh> }; return $string; }` [download] This prints found '❇' this has wide utf8 chars like ❇ (snowflake) which is the desired behaviour but other tests produce more puzzling results. For example using `fileread $file,':encoding(UTF-8)';` or `fileread $file,':encoding(ISO-8859-1)'` produced identical results but the following test `my $line = fileread $file,':encoding(UTF-8)'; $line = Encode::decode('UTF-8', $line, 'Encode::FB_CROAK');` [download] Was (I'm sure) producing `this has wide utf8 chars like вќ (snowflake)` a few hours ago but is now crashing the script giving `Undefined subroutine &Encode::decode called at - line 18.` if binmode is commented out and `Wide character at - line 18.` if it isn't. Maybe it was `utf8::encode` giving me the first line, things are getting kinda hazy at this point. It does produce the correct result when used with `fileread $file,':raw'` or `fileread $file,':encoding(ISO-8859-1)'`. Interestingly `unicode_strings` made no difference to the regex succeeding or failing in any of my tests as and `utf8::upgrade/downgrade` don't appear to do anything at all in this SSCCE. It would be nice to conclude that when in doubt just use `utf8::decode` but I've also been testing with `Net::Async::FastCGI` which also gives me a tied STDOUT only it does UTF-8 encoding on it which I need to turn off with `set_encoding( undef );` if I do that. ps I notice all the occurrences of ❇ in my code blocks have been turned into `❇` so it's some small comfort that perlmonks.org can't quite get a grip on this either. 😜	[reply] [d/l] [select]
Re^5: FCGI, tied handles and wide characters by hippo (Archbishop) on Sep 12, 2024 at 11:34 UTC
Really glad to see you now have a working SSCCE. I'm somewhat bemused by the use of scalar ref for the string in the subroutine - can you elaborate on why that is desired or necessary? There should be no problem with using `Encode::decode()` to do the decoding, and that is often what I use. Since you put the effort in to provide your SSCCE, here is mine in return, using this sub. use strict; use warnings; use utf8; use Encode qw/decode/; # For explicit decoding only binmode STDOUT, ':encoding(UTF-8)'; my $file = 'test_utf8'; print "Explicitly decoded:\n"; my $encoded_text = fileread ($file, ':raw'); my $text = decode ('UTF-8', $encoded_text, Encode::FB_CROAK); if ($text =~ /(❇)/) { print "found '$1'\n"; } print $text; print "\n\nImplicitly decoded:\n"; $text = fileread ($file, ':encoding(UTF-8)'); if ($text =~ /(❇)/) { print "found '$1'\n"; } print $text; sub fileread { my ($file,$enc) = @_; open my $fh, "< $enc", $file or die "Can't open $file: $!"; local $/; my $string = <$fh>; close $fh; return $string; } As is hopefully clear, this shows that the same result occurs whether by having the PerlIO layer perform the decoding implicitly or by performing it explicitly with `Encode::decode()` in the code (as you would need to do in order to process your FCGI parameters, for example). I have simplified `fileread()` to remove the apparently unnecessary scalar ref too. I notice all the occurrences of ❇ in my code blocks have been turned into `❇` Yes, it is a known issue and is an unfortunate consequence of this site pre-dating much of unicode handling. If you have utf-8 characters in your source you can use `<pre>` tags as I have here. 🦛	[reply] [d/l] [select]
Re^6: FCGI, tied handles and wide characters by Maelstrom (Beadle) on Sep 14, 2024 at 03:01 UTC
Really glad to see you now have a working SSCCE. I'm somewhat bemused by the use of scalar ref for the string in the subroutine - can you elaborate on why that is desired or necessary? Like most things Perl it's an embarrassing historical artifact, I cut'n'pasted from my usual file slurping routine which at a sprightly 50 lines was too big to fit in the SSCCE and didn't really think about it. Speaking of not thinking things through is there anything really wrong with the following SSCCE which accomplishes my original goal of bypassing perlio completely and using _utf8_on? Apart from the fact taint doesn't like it it seems like it would really speed up working on files that I know are utf-8. use utf8; use Encode qw(_utf8_on); my $file = 'test_utf8'; binmode STDOUT, ':encoding(UTF-8)'; my $line = &sysfileread($file); _utf8_on($line); if ($line =~ /(❇)/) { print "found '$1'\n"; } print $line; sub sysfileread { my ($file) = @_; my $string; open(my $fh, "<", $file) \|\| die "Can't open $file for reading: $!"; my $size_left = -s $fh; while( $size_left > 0 ) { my $read_cnt = sysread($fh, $string, $size_left, length $string); last unless( $read_cnt ); $size_left -= $read_cnt; } return $string; }	[reply]
Re^7: FCGI, tied handles and wide characters by hippo (Archbishop) on Sep 16, 2024 at 13:33 UTC
Re^8: FCGI, tied handles and wide characters by Maelstrom (Beadle) on Sep 21, 2024 at 09:46 UTC
Some notes below your chosen depth have not been shown here
Re^7: FCGI, tied handles and wide characters by NERDVANA (Priest) on Sep 17, 2024 at 01:40 UTC
Re^5: FCGI, tied handles and wide characters (snowflake obfus and emojis) by eyepopslikeamosquito (Archbishop) on Sep 13, 2024 at 09:38 UTC
I notice all the occurrences of ❇ in my code blocks have been turned into `❇` For more emoji fun, I like to make my PM snowflakes colourful - for example, from ❄ (`❄`) to ❄️ (`❄️`) to prettify this Snow flake obfu. :-) 👁️🍾👍🦟	[reply] [d/l] [select]