in reply to How do I safely, portably extract one or more bytes from a string?

All characters can be viewed as binary data too, so there isn't really a difference between these two. A possible problem is that with recent perls there may however be a difference between characters and bytes. That's not because bytes are somehow more binary than characters, but because the amount of representable characters has been extended beyond what can be encoded in a byte.

Several people suggested to use use bytes. However, this is usually a bad idea, since it makes the result of your operations depend on if the string is coded internally with utf8 or not, a representation detail you should most of the time not care about. Observe:

#!/usr/bin/perl -lw $a="à"; # A high latin1 character, doesn't even need unicode print '$a Normal substr: ', ord(substr($a,0,1)); { use bytes; print '$a Bytes substr: ', ord(substr($a,0,1)); } $b = $a . chr(256); chop $b; print '$a equals $b, but $b is internally in UTF8' if $a eq $b; print '$b Normal substr: ', ord(substr($b,0,1)); { use bytes; print '$b Bytes substr: ', ord(substr($b,0,1)); } Giving: $a Normal substr: 224 $a Bytes substr: 224 $a equals $b but $b is internally in UTF8 $b Normal substr: 224 $b Bytes substr: 195

Be suspicious of code that uses use bytes.

The real question is "where does your string come from". That will determine the proper answer.

If you were talking about a string coming from most oldfashioned sources, e.g. read from a file (opened in binmode, or a plain open if there isn't a utf8 default), there is simply no difference between bytes and characters, and you can simply use substr(), which will work independent of the internal representation of the string.

If you indeed were talking about a string for example read from a file opened for utf8 and you want the byte at a certain offset in the sequence of raw bytes representing that string, maybe you should in fact have opened the file binary instead...

If the string is something else that could be unicode (e.g. coming from some unicode aware subroutine), and you want the byte at a certain offset in the utf8 representation of the string, the cleanest way is probably to use encode to get the octet string:

use Encode; $octets = encode("utf8", $string);
Instead you could also first "upgrade" the string using utf8::upgrade to make sure the internal representation is UTF8, after which "use bytes" will give predictable results. But I think that's rather hacky.

And finally there is the possibility that the string is indeed one with unicode characters but that you should stop thinking in terms of bytes and just get the n-th character which corresponds to a certain codepoint whose value can now simply exceed 255. In which case plain substr() is what you want again.

You can always view a string as a sequence of integers, where these integers can represent certain characters. Recent perls just allow some of the integers to be 256 or greater. And UTF8 is just a way (and NOT the only way) to encode this sequence of integers using a classic sequence of bytes. The byte sequence however isn't necessarily the same as the integer sequence (though it can be if all integers are small enough), which is in the end why the proper answer to your question depends on in which of these two sequences you want the element at a given offset.

Replies are listed 'Best First'.
Re: Re: How do I safely, portably extract one or more bytes from a string?
by pg (Canon) on Nov 29, 2003 at 05:37 UTC

    The problem is that you messed up byte context with character context. Just adding one line, your demo code can be easily fixed to demo the right result:

    #!/usr/bin/perl -lw $a="à"; # A high latin1 character, doesn't even need unicode print '$a Normal substr: ', ord(substr($a,0,1)); { use bytes; print '$a Bytes substr: ', ord(substr($a,0,1)); } { use bytes;#I added this $b = $a . chr(256); } chop $b; print '$a equals $b, but $b is internally in UTF8' if $a eq $b; print '$b Normal substr: ', ord(substr($b,0,1)); { use bytes; print '$b Bytes substr: ', ord(substr($b,0,1)); }

    This gives:

    $a Normal substr: 224 $a Bytes substr: 224 $a equals $b, but $b is internally in UTF8 $b Normal substr: 224 $b Bytes substr: 224

    update after read thospel's reply:

    thospel, my point is not to argue with you about the encoding or representation. The point is that, you tried to use your demo to disapprove "use bytes", but it actually did the opposite, and proved "use bytes" is alright. In your case, the first byte of $a and $b are different, and Perl did printed different ord, so it proved that "use bytes" is just fine.

    All what the OP asked is how to safely get the first byte, and "use bytes" is one of the correct way to do it. I just don't get how your big lesson on encoding is related to the original question. By reading the original post, to me, the author does not sounds like someone has no idea about all the encoding stuff, my feeling is that he knows quite a lot, otherwise he would not even ask the right question.

    Your demo on "use bytes" simply cannot be used to disapprove "use bytes", and is misleading in general.

      No, I didn't mess it up, I demonstrated exactly what I wanted to demonstrate: that the same character has two different possible internal representations in perl (notice they compare eq for perl). And that these two representations give a different result for substr() under use bytes.

      You can for example use Dump() from Devel::Peek to see that internally they are different of course. But code shouldn't depend on how the string happens to be encoded internally if it can be avoided.

      Your example however leaves both $a and $b with the same internal representation (non-utf8), so of course they print the same. It also isn't related to my point anymore. Notice that the "but $b is internally in UTF8" isn't actually true for your code.

      update after reading pg's reply

      I wasn't trying to "disprove" use bytes, it obviously does what it is supposed to do. I was however trying to show that it makes the result depend on the internal representation. Our disagreement is about first if that's a good idea and second if that's what the OP wanted. You obviously think he wants the n-th byte of the internal representation of the string, while I assumed that if he's talking about unicode (which I'm still not sure of), he'd want the n-th byte of the UTF8 representation of the logical string.