Several people suggested to use use bytes. However, this is usually a bad idea, since it makes the result of your operations depend on if the string is coded internally with utf8 or not, a representation detail you should most of the time not care about. Observe:
#!/usr/bin/perl -lw $a="à"; # A high latin1 character, doesn't even need unicode print '$a Normal substr: ', ord(substr($a,0,1)); { use bytes; print '$a Bytes substr: ', ord(substr($a,0,1)); } $b = $a . chr(256); chop $b; print '$a equals $b, but $b is internally in UTF8' if $a eq $b; print '$b Normal substr: ', ord(substr($b,0,1)); { use bytes; print '$b Bytes substr: ', ord(substr($b,0,1)); } Giving: $a Normal substr: 224 $a Bytes substr: 224 $a equals $b but $b is internally in UTF8 $b Normal substr: 224 $b Bytes substr: 195
Be suspicious of code that uses use bytes.
The real question is "where does your string come from". That will determine the proper answer.
If you were talking about a string coming from most oldfashioned sources, e.g. read from a file (opened in binmode, or a plain open if there isn't a utf8 default), there is simply no difference between bytes and characters, and you can simply use substr(), which will work independent of the internal representation of the string.
If you indeed were talking about a string for example read from a file opened for utf8 and you want the byte at a certain offset in the sequence of raw bytes representing that string, maybe you should in fact have opened the file binary instead...
If the string is something else that could be unicode (e.g. coming from some unicode aware subroutine), and you want the byte at a certain offset in the utf8 representation of the string, the cleanest way is probably to use encode to get the octet string:
Instead you could also first "upgrade" the string using utf8::upgrade to make sure the internal representation is UTF8, after which "use bytes" will give predictable results. But I think that's rather hacky.use Encode; $octets = encode("utf8", $string);
And finally there is the possibility that the string is indeed one with unicode characters but that you should stop thinking in terms of bytes and just get the n-th character which corresponds to a certain codepoint whose value can now simply exceed 255. In which case plain substr() is what you want again.
You can always view a string as a sequence of integers, where these integers can represent certain characters. Recent perls just allow some of the integers to be 256 or greater. And UTF8 is just a way (and NOT the only way) to encode this sequence of integers using a classic sequence of bytes. The byte sequence however isn't necessarily the same as the integer sequence (though it can be if all integers are small enough), which is in the end why the proper answer to your question depends on in which of these two sequences you want the element at a given offset.
In reply to Re: How do I safely, portably extract one or more bytes from a string?
by thospel
in thread How do I safely, portably extract one or more bytes from a string?
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |