comment on

All characters can be viewed as binary data too, so there isn't really a difference between these two. A possible problem is that with recent perls there may however be a difference between characters and bytes. That's not because bytes are somehow more binary than characters, but because the amount of representable characters has been extended beyond what can be encoded in a byte.

Several people suggested to use use bytes. However, this is usually a bad idea, since it makes the result of your operations depend on if the string is coded internally with utf8 or not, a representation detail you should most of the time not care about. Observe:

#!/usr/bin/perl -lw

$a="à"; # A high latin1 character, doesn't even need unicode

print '$a Normal substr: ', ord(substr($a,0,1));
{
    use bytes;
    print '$a Bytes  substr: ', ord(substr($a,0,1));
}

$b = $a . chr(256); chop $b;
print '$a equals $b, but $b is internally in UTF8' if $a eq $b;

print '$b Normal substr: ', ord(substr($b,0,1));
{
    use bytes;
    print '$b Bytes  substr: ', ord(substr($b,0,1));
}

Giving:

$a Normal substr: 224
$a Bytes  substr: 224
$a equals $b but $b is internally in UTF8
$b Normal substr: 224
$b Bytes  substr: 195
[download]

Be suspicious of code that uses use bytes.

The real question is "where does your string come from". That will determine the proper answer.

If you were talking about a string coming from most oldfashioned sources, e.g. read from a file (opened in binmode, or a plain open if there isn't a utf8 default), there is simply no difference between bytes and characters, and you can simply use substr(), which will work independent of the internal representation of the string.

If you indeed were talking about a string for example read from a file opened for utf8 and you want the byte at a certain offset in the sequence of raw bytes representing that string, maybe you should in fact have opened the file binary instead...

If the string is something else that could be unicode (e.g. coming from some unicode aware subroutine), and you want the byte at a certain offset in the utf8 representation of the string, the cleanest way is probably to use encode to get the octet string:

use Encode;
$octets = encode("utf8", $string);
[download]

Instead you could also first "upgrade" the string using utf8::upgrade to make sure the internal representation is UTF8, after which "use bytes" will give predictable results. But I think that's rather hacky.

And finally there is the possibility that the string is indeed one with unicode characters but that you should stop thinking in terms of bytes and just get the n-th character which corresponds to a certain codepoint whose value can now simply exceed 255. In which case plain substr() is what you want again.

You can always view a string as a sequence of integers, where these integers can represent certain characters. Recent perls just allow some of the integers to be 256 or greater. And UTF8 is just a way (and NOT the only way) to encode this sequence of integers using a classic sequence of bytes. The byte sequence however isn't necessarily the same as the integer sequence (though it can be if all integers are small enough), which is in the end why the proper answer to your question depends on in which of these two sequences you want the element at a given offset.

In reply to Re: How do I safely, portably extract one or more bytes from a string? by thospel
in thread How do I safely, portably extract one or more bytes from a string? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.