read() and string comparison

hornpipe2 has asked for the wisdom of the Perl Monks concerning the following question:

I have a question about encoding, string comparison, and read() that is best expressed through an example. Here goes.

Suppose I am reading from a "binary" file. For this example let's use an IFF file since it's pretty familiar. Part of the process of decoding IFF involves checking the "Type ID" - a 4-byte indicator of chunk type, similar to a FourCC. Example type IDs might be "FORM", "LIST" etc. (but being 32 bits long, also quickly comparable to an int32 as 0x464f524d). Reading and doing something with the Type ID might look like this:

open(my $fp, '<:raw', $filename) or die "Couldn't open $filename: $!";

read($fp, my $type_id, 4);
if ($type_id eq 'FORM') {
  ...
}
[download]

Now my question is: is this string comparison always "safe"? In Python I think this would be an error, because it's comparing a "string" to what is technically a "bytearray". That language would force you to make an explicit conversion. In Perl, this is allowed, but I don't know if there are dangers. One option would be to "unpack" like so:

read($fp, my $buffer, 4);
my $type_id = unpack('A4', $buffer);
if ($type_id eq 'FORM') {
  ...
}
[download]

Now I've guaranteed that it is an ASCII string, but, have I really gained anything? Or is this overkill?

What's the encoding of a "string" read from a filehandle opened with :raw or changed with binmode()? What about the encoding of literal strings within my Perl script?

Comment on read() and string comparison Select or Download Code

Replies are listed 'Best First'.
Re: read() and string comparison by jcb (Parson) on Oct 26, 2019 at 23:54 UTC
Your first comparison is safe in the sense that it will match only if you actually read the expected FourCC. Since IFF is a binary format, you should already be setting `:raw` on the file. Otherwise, you may have frame sync problems if bytes above 127 appear in the input, since Perl will read more than 4 octets to get 4 Unicode characters in utf-8. If you really want exactly 4 octets, sysread is defined to read octets, while read is defined to read characters. If `:raw` is set on the filehandle, characters are octets, but otherwise could be utf-8. Confused yet? (I was confused about this for a long time.) Unlike Python, Perl strings are always sequences of codepoints, possibly stored using utf-8 if any codepoints exceed 255, otherwise Perl's strings actually are byte arrays, just like C except that there is an explicit length. Note that there are some builtins (vec springs to mind) that always treat strings as byte arrays, even if the utf-8 flag is set.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: read() and string comparison
by jcb (Parson) on Oct 26, 2019 at 23:54 UTC

Your first comparison is safe in the sense that it will match only if you actually read the expected FourCC. Since IFF is a binary format, you should already be setting :raw on the file. Otherwise, you may have frame sync problems if bytes above 127 appear in the input, since Perl will read more than 4 octets to get 4 Unicode characters in utf-8. If you really want exactly 4 octets, sysread is defined to read octets, while read is defined to read characters. If :raw is set on the filehandle, characters are octets, but otherwise could be utf-8. Confused yet? (I was confused about this for a long time.)

Unlike Python, Perl strings are always sequences of codepoints, possibly stored using utf-8 if any codepoints exceed 255, otherwise Perl's strings actually are byte arrays, just like C except that there is an explicit length. Note that there are some builtins (vec springs to mind) that always treat strings as byte arrays, even if the utf-8 flag is set.

[reply]
[d/l]
[select]