Counting bytes in a Unicode document

jwkrahn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Counting bytes in a Unicode document by hippo (Archbishop) on Oct 06, 2024 at 21:10 UTC
The solution is just to count the bytes, if that's really what you want. `my $bytecount = (stat $file)[7];` [download] 🦛	[reply] [d/l]
Re: Counting bytes in a Unicode document by parv (Parson) on Oct 07, 2024 at 07:30 UTC
`use utf8; use open ':std', ':utf8'; use feature 'unicode_strings'; ... for my $file ( @files ) { open my $FH, '<', $file or die "Cannot open '$file' because: $!"; my $size = read $FH, my $data, -s $FH or die "Cannot read from '$f +ile' because: $!"; $size == length( $data ) or die "Error reading from '$file'\n"; close $FH; ...` [download] This will always result (if $data contains Unicode characters) in dieing with the message "Error reading from '$file'\n". Unlike in OP, here with perl v5.36.3/FreeBSD 14 (with the code below) the numbers from `length` & `read` functions match (226); do not obviously match the size in byte (282). Is it matter of the perl version; and/or, is it how one is holding the encoding layer? `perl check-size.pl ./data.mixed $VAR1 = { '-s' => 282, 'length' => 226, 'read' => 226, 'tr' => 226 };` [download] Read more... check-size.pl: Program to count the 'size' from file generated by 'make-data.pl' (914 Bytes) Base64-encoded generated data -- if anyone would rather have it than run above "make-data.pl" -- with SHA256 sum of `a1e81919b72403bd3c9d95979dae8d928354cf64de8d999bcad00aec4298cf76` (decoded value has the sum of `02bec20353169b58ea51558fa7d6c4316bd13f5e4a3d976484e8ed7700935962`) ... Read more... base64-encoded ./data.mixed (894 Bytes) Update: ugh... The Base64-encoded sample has different checksum after downloaded from here (than on the machine where it was posted from). The checksum for the decoded value is still the same as posted. Removed the program that produced the sample Unicode output as I had apparently changed it (during the (re)editing of this post) such that it produced different output than the above mentioned Bas64-encoded text (for which the size result is shown).	[reply] [d/l] [select]
Re: Counting bytes in a Unicode document by LanX (Saint) on Oct 06, 2024 at 21:11 UTC
I'm confused, do you want to count the "bytes" or the "characters" in $data? Update I've never used `read` , but it seems you are using `-s` to get the size in bytes but read will attempt the size in Unicode characters because of the utf8 layer. Is this really what you want? That's kind of a "creative" way to slurp a whole file... Anyway to answer the title's question Counting bytes in a Unicode document Remove temporarily the `utf8` flag and use `length` then. While I doubt that's what you want, others might stumble over this thread asking exactly this. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^2: Counting bytes in a Unicode document by ysth (Canon) on Oct 08, 2024 at 01:11 UTC
Remove temporarily the utf8 flag and use length then. You can just use bytes::length (after `use bytes ();`). -- A math joke: r = \| \|csc(θ)\|+\|sec(θ)\| \|-\| \|csc(θ)\|-\|sec(θ)\| \|	[reply] [d/l]
Re^3: Counting bytes in a Unicode document by LanX (Saint) on Oct 08, 2024 at 10:01 UTC
Indeed, The `bytes` pragma is in core and require bytes; bytes::length($data); does the job. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^4: Counting bytes in a Unicode document by ikegami (Patriarch) on Oct 08, 2024 at 15:49 UTC
Re^5: Counting bytes in a Unicode document by LanX (Saint) on Oct 08, 2024 at 17:12 UTC
Some notes below your chosen depth have not been shown here
Re: Counting bytes in a Unicode document by ikegami (Patriarch) on Oct 08, 2024 at 15:52 UTC
Replacing `open my $FH, '<', $file` [download] with `open my $FH, '<:raw', $file` [download] will do the trick, but the aforementioned `stat` solution is much more efficient.	[reply] [d/l] [select]
Re^2: Counting bytes in a Unicode document by LanX (Saint) on Oct 08, 2024 at 17:17 UTC
> the aforementioned stat solution is much more efficient I suppose the `-s` in the OP's code does exactly the same. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: Counting bytes in a Unicode document by ikegami (Patriarch) on Oct 09, 2024 at 00:32 UTC
Yes, `-s` is just another way to `stat`. For example, one could write the following: defined( my $s = -s $qfn ) or die( "Can't stat `$qfn`: $!\n" ); ... $s ... [download] But it could also be written as follows: stat( $qfn ) or die( "Can't stat `$qfn`: $!\n" ); ... -s _ ... [download] or defined( my $s = ( stat( $qfn ) )[ 7 ] ) or die( "Can't stat `$qfn`: $!\n" ); ... $s ... [download]	[reply] [d/l] [select]
Re^4: Counting bytes in a Unicode document by etj (Priest) on Oct 15, 2024 at 12:39 UTC
Re^5: Counting bytes in a Unicode document by ikegami (Patriarch) on Oct 15, 2024 at 15:38 UTC
Some notes below your chosen depth have not been shown here
Re: Counting bytes in a Unicode document by sectokia (Friar) on Oct 15, 2024 at 04:11 UTC
Every time I see people using UTF-8 I think they massively over complicate it. All you usually need to do is use decode() and encode() on your scalars. Example: read the bytes raw, then get the number of Unicode characters by decoding the utf-8 into a scalar of unicode characters: `use Encode; open my $FH, '<:raw', $utf8file; read $FH, my $data, -s $FH; print length decode("utf8",$data);` [download]	[reply] [d/l]
Re: Counting bytes in a Unicode document by ysth (Canon) on Oct 09, 2024 at 20:35 UTC
What are you trying to accomplish with your `$size == length( $data )` check? Verifying that the entire file was read? `read` is supposed to be returning a count of characters returned, which should always match the length. From your description you are getting a byte count instead? What perl version are you using? Are you using CORE::read or are you doing something to override it? -- A math joke: r = \| \|csc(θ)\|+\|sec(θ)\| \|-\| \|csc(θ)\|-\|sec(θ)\| \|	[reply] [d/l] [select]

Update