Re: Re: How do I safely, portably extract one or more bytes from a string?

The problem is that you messed up byte context with character context. Just adding one line, your demo code can be easily fixed to demo the right result:

#!/usr/bin/perl -lw

$a="à"; # A high latin1 character, doesn't even need unicode

print '$a Normal substr: ', ord(substr($a,0,1));
{
    use bytes;
    print '$a Bytes  substr: ', ord(substr($a,0,1));
}
{
    use bytes;#I added this
    $b = $a . chr(256); 
}
chop $b;
print '$a equals $b, but $b is internally in UTF8' if $a eq $b;

print '$b Normal substr: ', ord(substr($b,0,1));
{
    use bytes;
    print '$b Bytes  substr: ', ord(substr($b,0,1));
}
[download]

This gives:

$a Normal substr: 224
$a Bytes  substr: 224
$a equals $b, but $b is internally in UTF8
$b Normal substr: 224
$b Bytes  substr: 224
[download]

update after read thospel's reply:

thospel, my point is not to argue with you about the encoding or representation. The point is that, you tried to use your demo to disapprove "use bytes", but it actually did the opposite, and proved "use bytes" is alright. In your case, the first byte of $a and $b are different, and Perl did printed different ord, so it proved that "use bytes" is just fine.

All what the OP asked is how to safely get the first byte, and "use bytes" is one of the correct way to do it. I just don't get how your big lesson on encoding is related to the original question. By reading the original post, to me, the author does not sounds like someone has no idea about all the encoding stuff, my feeling is that he knows quite a lot, otherwise he would not even ask the right question.

Your demo on "use bytes" simply cannot be used to disapprove "use bytes", and is misleading in general.

Comment on Re: Re: How do I safely, portably extract one or more bytes from a string? Select or Download Code

Replies are listed 'Best First'.
Re^3: How do I safely, portably extract one or more bytes from a string? by thospel (Hermit) on Nov 29, 2003 at 06:05 UTC
No, I didn't mess it up, I demonstrated exactly what I wanted to demonstrate: that the same character has two different possible internal representations in perl (notice they compare eq for perl). And that these two representations give a different result for substr() under use bytes. You can for example use Dump() from Devel::Peek to see that internally they are different of course. But code shouldn't depend on how the string happens to be encoded internally if it can be avoided. Your example however leaves both $a and $b with the same internal representation (non-utf8), so of course they print the same. It also isn't related to my point anymore. Notice that the "but $b is internally in UTF8" isn't actually true for your code. update after reading pg's reply I wasn't trying to "disprove" `use bytes`, it obviously does what it is supposed to do. I was however trying to show that it makes the result depend on the internal representation. Our disagreement is about first if that's a good idea and second if that's what the OP wanted. You obviously think he wants the n-th byte of the internal representation of the string, while I assumed that if he's talking about unicode (which I'm still not sure of), he'd want the n-th byte of the UTF8 representation of the logical string.	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: How do I safely, portably extract one or more bytes from a string?
by thospel (Hermit) on Nov 29, 2003 at 06:05 UTC

same

You can for example use Dump() from Devel::Peek to see that internally they are different of course. But code shouldn't depend on how the string happens to be encoded internally if it can be avoided.

Your example however leaves both $a and $b with the same internal representation (non-utf8), so of course they print the same. It also isn't related to my point anymore. Notice that the "but $b is internally in UTF8" isn't actually true for your code.

update after reading pg's reply

I wasn't trying to "disprove" use bytes, it obviously does what it is supposed to do. I was however trying to show that it makes the result depend on the internal representation. Our disagreement is about first if that's a good idea and second if that's what the OP wanted. You obviously think he wants the n-th byte of the internal representation of the string, while I assumed that if he's talking about unicode (which I'm still not sure of), he'd want the n-th byte of the UTF8 representation of the logical string.

[reply]
[d/l]