dkg has asked for the wisdom of the Perl Monks concerning the following question:

Hello, kind monks--

I have a smallish perl script that parses/generates binary streams (a select subset of OpenPGP/RFC 4880 packets). It needs to run cleanly in both perl 5.8 and 5.10 environments for portability, and i've run into some confusion about the changes in pack and unpack between perl 5.8 and 5.10. In particular, i occasionally need to pack and unpack raw 8-bit values, (and to checksum them) and the unicode transitions have left me confused.

I've explicitly set use bytes;, but i'm not convinced that this is enough to ensure that i don't get screwed up results when run under unexpected locales or environments. I'm looking for guidance.

perldoc -f unpack references SYSV checksums in both versions, but 5.8 shows the algorithm as:

$checksum = do { local $/; # slurp! unpack("%32C*",<>) % 65535; };
while 5.10 shows it as:
$checksum = do { local $/; # slurp! unpack("%32W*",<>) % 65535; };
Is there a way to compute this portably without explicitly checking the version number of perl that is running? Can someone give me a concrete example of how it might break in 5.10 if i use "%32C*" instead of %32W*?

In a related note, when i'm un/packing literal bytes (but not checksumming), I'm currently using "C" -- should i be using something else? Do i need to be explicitly doing something to the incoming/outgoing data to force it to be treated as a binary blob instead of as a unicode string, even though i'm already declaring use bytes;?

In my research for this, i came across a post that leaves me worried about unexpected behavior from the rest of the archive, but i confess i don't understand the issues well enough to understand that post well enough to know what the Right Thing to do is for code that needs to be able to run correctly under both 5.8 and 5.10 and deals with raw binary data.

Any advice or pointers to specific reading would be most appreciated.

Replies are listed 'Best First'.
Re: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by ikegami (Patriarch) on Mar 11, 2009 at 05:32 UTC
    It really depends on how you open the file. The best way in 5.8+ is probably
    # Raw buffered handle. open(my $fh, '<:perlio', $qfn) or die("open $qfn: $!\n");

    For compatibility with 5.6, you'd use

    open(my $fh, '<', $qfn) or die("open $qfn: $!\n"); binmode($fh);

    In both case, you only get bytes, so "C" is perfectly acceptable.

Re: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by ikegami (Patriarch) on Mar 11, 2009 at 05:55 UTC

    If this was some generic checksum routine, you could add some validation (5.8+)

    sub checksum { my $s = shift; utf8::downgrade($s, 1) or croak("Wide character in subroutine entry"); return unpack("%32C*", $s) % 65535; }

    For earlier backwards compatibility, that would be

    sub checksum { my $s = shift; if (defined(&utf8::downgrade)) { utf8::downgrade($s, 1) or croak("Wide character in subroutine entry"); } return unpack("%32C*", $s) % 65535; }

    Update: Switched utf8->can() (method check) for defined(&) (sub check).

Re: Understanding pack and unpack changes for binary data between 5.8 and 5.10
by squentin (Sexton) on Mar 11, 2009 at 14:01 UTC
    I think you either have to 'use bytes', or make sure you don't use variables that have their utf8 flag set.

    I've been bitten by one of the changes in perl 5.10 :
    pack('V/a*',$a) returns a value with the utf8 flag if $a has it, unless you "use bytes". It didn't do that in perl 5.8. Am I the only one to find this new behavior very strange ? the value returned by pack('V/a*',$a) is binary, interpreting it as utf8 makes no sense :(

      ...It didn't do that in perl 5.8

      Another difference to be aware of is this:

      my $s = "\x{1234}\x{5678}"; # string with utf8 flag on print unpack("H*", $s), "\n";

      With 5.8 this prints a hexdump of the internal (UTF-8) representation of the string — e.g. useful when debugging encoding issues

      e188b4e599b8

      while with 5.10, you'd get

      3478

      i.e. the low-byte values of the codepoints, with the high-byte part being truncated. With warnings enabled, you also get "Character in 'H' format wrapped in unpack at...".

      With use bytes, or when explicitly turning off the utf8 flag (update: as shown below), you get the old behaviour.  And specifically for debugging encoding issues, Devel::Peek is the recommended alternative since 5.10, because of this difference.

        with 5.10, you'd get [...] the low-byte values of the codepoints, with the high-byte part being truncated. With warnings enabled, you also get "Character in 'H' format wrapped in unpack at...".

        It's odd that it doesn't warn or croak with "Wide character in ...".

        If you want to dump the internal buffer,

        use Encode qw( _utf8_off ); sub internal { _utf8_off( my $s = shift ); return $s; } my $s = "\x{1234}\x{5678}"; # string with utf8 flag on print unpack("H*", internal($s)), "\n";

        Update: Fixed error identified in reply.

        I don't see the problem.

        use strict; use warnings; use Data::Dumper qw( Dumper ); $Data::Dumper::Useqq = 1; $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; my $s = chr(0xC9); utf8::downgrade($s); print(Dumper(unpack('H*', $s)), "\n"); utf8::upgrade($s); print(Dumper(unpack('H*', $s)), "\n"); print(Dumper(unpack('H*', "\x{C9}\x{2660}")), "\n");

        5.10.0:

        "c9" # Ok "c9" # Ok Character in 'H' format wrapped in unpack at 750077.pl line 16. "c960" # GIGO

        The internal representation is and should be irrelevant.

        If you want to see the internal representation, it stands to reason that you should have to explicitely fetch it.

      It's a bit strange, but the internal representation of the string shouldn't* matter.

      What I do find very strange is that it doesn't croak when passed non-bytes.

      use strict; use warnings; use Data::Dumper qw( Dumper ); $Data::Dumper::Useqq = 1; $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; my $s = chr(0xC9); utf8::downgrade($s); print(Dumper(pack('V/a*', $s)), "\n"); utf8::upgrade($s); print(Dumper(pack('V/a*', $s)), "\n"); print(Dumper(pack('V/a*', "\x{C9}\x{2660}")), "\n");

      5.10.0:

      "\1\0\0\0\311" # Ok "\1\0\0\0\x{c9}" # Ok "\2\0\0\0\x{c9}\x{2660}" # Does this make sense???

      On the other hand, 5.8.8 was very broken:

      "\1\0\0\0\311" # Ok "\1\0\0\0\303" # XXX "\2\0\0\0\303\242" # XXX
      * — I realize it matters all to often, but that's getting fixed. In plfaces where it does matter, you can use utf8::upgrade and utf8::downgrade to control the internal format.
        The problem is that when I do a length on the return value. Of course I should have used "bytes", but as I said, the return value is a binary string, so returning a length in utf8 characters is strange.
        And what's great with this bug, is that you only see it when the original string has multi-bytes characters or when it is long enough. :)
        use Encode qw/_utf8_on/; my $a="bj\xc3\xb6rk"; _utf8_on($a); my $binarystring=pack("V/a*", $a); warn length $binarystring; warn bytes::length $binarystring; my $b="b"x1000; _utf8_on($b); my $binarystring2=pack("V/a*", $b); warn length $binarystring2; warn bytes::length $binarystring2;