in reply to Re: Understanding pack and unpack changes for binary data between 5.8 and 5.10
in thread Understanding pack and unpack changes for binary data between 5.8 and 5.10

It's a bit strange, but the internal representation of the string shouldn't* matter.

What I do find very strange is that it doesn't croak when passed non-bytes.

use strict; use warnings; use Data::Dumper qw( Dumper ); $Data::Dumper::Useqq = 1; $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; my $s = chr(0xC9); utf8::downgrade($s); print(Dumper(pack('V/a*', $s)), "\n"); utf8::upgrade($s); print(Dumper(pack('V/a*', $s)), "\n"); print(Dumper(pack('V/a*', "\x{C9}\x{2660}")), "\n");

5.10.0:

"\1\0\0\0\311" # Ok "\1\0\0\0\x{c9}" # Ok "\2\0\0\0\x{c9}\x{2660}" # Does this make sense???

On the other hand, 5.8.8 was very broken:

"\1\0\0\0\311" # Ok "\1\0\0\0\303" # XXX "\2\0\0\0\303\242" # XXX
* — I realize it matters all to often, but that's getting fixed. In plfaces where it does matter, you can use utf8::upgrade and utf8::downgrade to control the internal format.

Replies are listed 'Best First'.
Re^3: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by squentin (Sexton) on Mar 12, 2009 at 15:54 UTC
    The problem is that when I do a length on the return value. Of course I should have used "bytes", but as I said, the return value is a binary string, so returning a length in utf8 characters is strange.
    And what's great with this bug, is that you only see it when the original string has multi-bytes characters or when it is long enough. :)
    use Encode qw/_utf8_on/; my $a="bj\xc3\xb6rk"; _utf8_on($a); my $binarystring=pack("V/a*", $a); warn length $binarystring; warn bytes::length $binarystring; my $b="b"x1000; _utf8_on($b); my $binarystring2=pack("V/a*", $b); warn length $binarystring2; warn bytes::length $binarystring2;

      $a is 5 bytes long and pack("v") is 4 bytes long, so $binarystring should hold 9 bytes. length($binarystring) confirms the length, and utf8::downgrade would confirm that they are bytes.

      $b is 1000 bytes long and pack("v") is 4 bytes long, so $binarystring2 should hold 1004 bytes. length($binarystring2) confirms the length, and utf8::downgrade would confirm that they are bytes.

      And what's great with this bug, is that you only see it when the original string has multi-bytes characters or when it is long enough. :)

      I don't see the problem. Are you expecting something other than 9 and 1004? Yes, the length of the internal representation is different (as reported by bytes::length), but why are you mucking with the internals?

      Speaking of mucking with internals, utf8::decode should normally be used instead of _utf8_on.

      so returning a length in utf8 characters is strange.

      It's a bit odd, but only because it's a bit inefficient.

        I needed the length of the string to write the string and its length in a binary file.

        I'm only using _utf8_on in this example, in the original code, the string already had its utf8 flag on (it was coming from gtk2 (which uses utf8 everywhere), so I was expecting it to be utf8-encoded.

        I understand that my code was ambiguous because it depends on the internal representation, I've written it a long time ago when I didn't have much experience in perl, and didn't really know how utf8 was handled.

        But I don't think using a string in pack should result in something that depends on the internal representation of the string : the internal representation should be internal :)

        Honestly, I don't like how utf8 is handled in perl, it tries to do everything automagically, but this makes things less clear.