in reply to Re^4: FCGI, tied handles and wide characters
in thread FCGI, tied handles and wide characters

Really glad to see you now have a working SSCCE. I'm somewhat bemused by the use of scalar ref for the string in the subroutine - can you elaborate on why that is desired or necessary?

There should be no problem with using Encode::decode() to do the decoding, and that is often what I use. Since you put the effort in to provide your SSCCE, here is mine in return, using this sub.

use strict;
use warnings;
use utf8;

use Encode qw/decode/; # For explicit decoding only

binmode STDOUT, ':encoding(UTF-8)';

my $file = 'test_utf8';

print "Explicitly decoded:\n";
my $encoded_text = fileread ($file, ':raw');
my $text = decode ('UTF-8', $encoded_text, Encode::FB_CROAK);

if ($text =~ /(❇)/) { print "found '$1'\n"; }
print $text;

print "\n\nImplicitly decoded:\n";
$text = fileread ($file, ':encoding(UTF-8)');

if ($text =~ /(❇)/) { print "found '$1'\n"; }
print $text;

sub fileread {
  my ($file,$enc) = @_;
  open my $fh, "< $enc", $file or die "Can't open $file: $!";
  local $/;
  my $string = <$fh>;
  close $fh;
  return $string;
}

As is hopefully clear, this shows that the same result occurs whether by having the PerlIO layer perform the decoding implicitly or by performing it explicitly with Encode::decode() in the code (as you would need to do in order to process your FCGI parameters, for example). I have simplified fileread() to remove the apparently unnecessary scalar ref too.

I notice all the occurrences of ❇ in my code blocks have been turned into &#10055;

Yes, it is a known issue and is an unfortunate consequence of this site pre-dating much of unicode handling. If you have utf-8 characters in your source you can use <pre> tags as I have here.


🦛

Replies are listed 'Best First'.
Re^6: FCGI, tied handles and wide characters
by Maelstrom (Beadle) on Sep 14, 2024 at 03:01 UTC

      Really glad to see you now have a working SSCCE. I'm somewhat bemused by the use of scalar ref for the string in the subroutine - can you elaborate on why that is desired or necessary?

    Like most things Perl it's an embarrassing historical artifact, I cut'n'pasted from my usual file slurping routine which at a sprightly 50 lines was too big to fit in the SSCCE and didn't really think about it. Speaking of not thinking things through is there anything really wrong with the following SSCCE which accomplishes my original goal of bypassing perlio completely and using _utf8_on? Apart from the fact taint doesn't like it it seems like it would really speed up working on files that I know are utf-8.

    use utf8;
    use Encode qw(_utf8_on);
    my $file = 'test_utf8';
    
    binmode STDOUT, ':encoding(UTF-8)';
    my $line = &sysfileread($file);
    _utf8_on($line);
    
    if ($line =~ /(❇)/) { print "found '$1'\n"; }
    print $line;
    
    sub sysfileread {
      my ($file) = @_;
      my $string;
      open(my $fh, "<", $file) || die "Can't open $file for reading: $!";
      my $size_left = -s $fh;
      while( $size_left > 0 ) {
        my $read_cnt = sysread($fh, $string, $size_left, length $string);
        last unless( $read_cnt );
        $size_left -= $read_cnt;
      }
      return $string;
    }
    
      is there anything really wrong with the following

      It depends what you understand by "really wrong". It will run, but I would not choose to use it in production for these reasons:

      1. The _utf8_on subroutine comes with the caveat: The following API uses parts of Perl's internals in the current implementation. As such, they are efficient but may change in a future release. It would not be good if a future version suddenly broke it.
      2. The subroutine performs no validity checking on its input whatsoever. The first time it is fed non-utf8 input, it will corrupt your data (at best!).
      3. As stated, it won't run under taint mode. That should be some indication to you that it is not suitable for public use.

      Have you benchmarked it to see how much faster it really is compared with Encode::decode()? Always benchmark before optimising.


      🦛

        It might've been different with a bigger file but my benchmarking indicate considerably faster than Encode::decode()
                         Rate      implicit encode_decode    utf_decode       utf8_on
        implicit      34929/s            --          -31%          -60%          -60%
        encode_decode 50765/s           45%            --          -41%          -43%
        utf_decode    86322/s          147%           70%            --           -2%
        utf8_on       88314/s          153%           74%            2%            --   
        
        It was a pleasant surprise to see utf::decode get so close though. Although given utf::decode won't protect me from non-utf8 input either I guess the optimal solution is
        $line = Encode::decode('UTF-8', $line) unless (utf8::decode($line));

      If you want maximum performance while still being safe, try utf8::decode. It logically decodes the string in-place, but because perl uses utf-8 internally the implementation just scans the string for utf-8 validity and then turns on the utf8 bit if it encountered any characters above 0x7F. You get the added benefit that it *doesn't* turn on the utf-8 bit if the entire string is 7-bit ascii, which can optimize code in various places when using that string.

      Likewise, utf8::encode is the fastest (safe) option for output.

      Just beware that they operate in-place, and may screw up assumptions made in the rest of the code if you didn't create a copy first, or discard the data afterward.