jwkrahn has asked for the wisdom of the Perl Monks concerning the following question:

My original code:

use utf8; use open ':std', ':utf8'; use feature 'unicode_strings'; ... for my $file ( @files ) { open my $FH, '<', $file or die "Cannot open '$file' because: $!"; my $size = read $FH, my $data, -s $FH or die "Cannot read from '$f +ile' because: $!"; $size == length( $data ) or die "Error reading from '$file'\n"; close $FH; ...

This will always result (if $data contains Unicode characters) in dieing with the message "Error reading from '$file'\n".

The solution is to use this instead:

$size == $data =~ y///c or die "Error reading from '$file'\n";

Hope this helps.

Please correct me if I am wrong.

Naked blocks are fun! -- Randal L. Schwartz, Perl hacker

Replies are listed 'Best First'.
Re: Counting bytes in a Unicode document
by hippo (Archbishop) on Oct 06, 2024 at 21:10 UTC
    The solution is

    just to count the bytes, if that's really what you want.

    my $bytecount = (stat $file)[7];

    🦛

Re: Counting bytes in a Unicode document
by parv (Parson) on Oct 07, 2024 at 07:30 UTC
    use utf8; use open ':std', ':utf8'; use feature 'unicode_strings'; ... for my $file ( @files ) { open my $FH, '<', $file or die "Cannot open '$file' because: $!"; my $size = read $FH, my $data, -s $FH or die "Cannot read from '$f +ile' because: $!"; $size == length( $data ) or die "Error reading from '$file'\n"; close $FH; ...

    This will always result (if $data contains Unicode characters) in dieing with the message "Error reading from '$file'\n".

    Unlike in OP, here with perl v5.36.3/FreeBSD 14 (with the code below) the numbers from length & read functions match (226); do not obviously match the size in byte (282). Is it matter of the perl version; and/or, is it how one is holding the encoding layer?

    perl check-size.pl ./data.mixed $VAR1 = { '-s' => 282, 'length' => 226, 'read' => 226, 'tr' => 226 };

    Base64-encoded generated data -- if anyone would rather have it than run above "make-data.pl" -- with SHA256 sum of a1e81919b72403bd3c9d95979dae8d928354cf64de8d999bcad00aec4298cf76 (decoded value has the sum of 02bec20353169b58ea51558fa7d6c4316bd13f5e4a3d976484e8ed7700935962) ...

    Update: *ugh*...

    • The Base64-encoded sample has different checksum after downloaded from here (than on the machine where it was posted from). The checksum for the decoded value is still the same as posted.
    • Removed the program that produced the sample Unicode output as I had apparently changed it (during the (re)editing of this post) such that it produced different output than the above mentioned Bas64-encoded text (for which the size result is shown).
Re: Counting bytes in a Unicode document
by LanX (Saint) on Oct 06, 2024 at 21:11 UTC
    I'm confused, do you want to count the "bytes" or the "characters" in $data?

    Update
    I've never used read , but it seems you are using -s to get the size in bytes but read will attempt the size in Unicode characters because of the utf8 layer. Is this really what you want?

    That's kind of a "creative" way to slurp a whole file...

    Anyway to answer the title's question

    • Counting bytes in a Unicode document
    Remove temporarily the utf8 flag and use length then.

    While I doubt that's what you want, others might stumble over this thread asking exactly this.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      Remove temporarily the utf8 flag and use length then.
      You can just use bytes::length (after use bytes ();).

      --
      A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |
Re: Counting bytes in a Unicode document
by ikegami (Patriarch) on Oct 08, 2024 at 15:52 UTC

    Replacing

    open my $FH, '<', $file
    with
    open my $FH, '<:raw', $file

    will do the trick, but the aforementioned stat solution is much more efficient.

      > the aforementioned stat solution is much more efficient

      I suppose the -s in the OP's code does exactly the same.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        Yes, -s is just another way to stat.

        For example, one could write the following:

        defined( my $s = -s $qfn ) or die( "Can't stat `$qfn`: $!\n" ); ... $s ...

        But it could also be written as follows:

        stat( $qfn ) or die( "Can't stat `$qfn`: $!\n" ); ... -s _ ...

        or

        defined( my $s = ( stat( $qfn ) )[ 7 ] ) or die( "Can't stat `$qfn`: $!\n" ); ... $s ...
Re: Counting bytes in a Unicode document
by sectokia (Friar) on Oct 15, 2024 at 04:11 UTC

    Every time I see people using UTF-8 I think they massively over complicate it. All you usually need to do is use decode() and encode() on your scalars.

    Example: read the bytes raw, then get the number of Unicode characters by decoding the utf-8 into a scalar of unicode characters:

    use Encode; open my $FH, '<:raw', $utf8file; read $FH, my $data, -s $FH; print length decode("utf8",$data);
Re: Counting bytes in a Unicode document
by ysth (Canon) on Oct 09, 2024 at 20:35 UTC

    What are you trying to accomplish with your $size == length( $data ) check?

    Verifying that the entire file was read?

    read is supposed to be returning a count of characters returned, which should always match the length. From your description you are getting a byte count instead?

    What perl version are you using?

    Are you using CORE::read or are you doing something to override it?



    --
    A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |