mxb has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I'm a little confused regarding seeking with the x command with unpack.

This came about in this node in my thread yesterday.

In a format such as PNG, where the first field to extract is not at the start of the packed format, you seek with x before extracting. However, when the string to be unpacked is 'too short' in that it doesn't have enough bytes to seek with 'x', Perl will die rather than safely stop parsing.

I have reduced this down to the following test-case. Is this expected behaviour with x or am I attempting to approach the problem incorrectly?

Many thanks.

#!/usr/bin/env perl use strict; use warnings; use 5.016; my $str = "\x00\x00\x01\x00\x00\x02"; my @vals; # This works - exact multiple of 'x2C' say "length:", length($str); @vals = unpack "(x2C)*", $str; say join "\n", @vals; $str .= "\x00"; # This works - not a multiple of 'SC' say "length:", length($str); @vals = unpack "(SC)*", $str; say join "\n", @vals; # This fails - not a multiple of 'x2C' say "length:", length($str); @vals = unpack "(x2C)*", $str; # ^-- dies here with: 'x' outside of string in unpack say join "\n", @vals;

Replies are listed 'Best First'.
Re: Seeking with 'x' in unpack and out of bounds reads
by Eily (Monsignor) on Apr 27, 2018 at 10:30 UTC

    unpack, unlike regexes, allows partial matches. Eg, with $str = "abc", $str =~ /(.{2})*/; will match "ab" but unpack "(a2)*", "abc" will return "ab", "c". In the case of unpack, the second iteration of the sub pattern "a2" is partial. Without 'x', you can tell that a match is partial in unpack because the output is incompatible with the pattern. In my example, you expected 2 characters but got one, so this has to be a partial match. However, 'x' doesn't have a direct effect on the output, so a partial match involving 'x' must be communicated through an error.

    Note that cases like unpack "(xa)*", "abcd"; don't die because this is not a partial match, the sub pattern "xa" matches exactly 2 times, which is a valid value for *. unpack "(ax)*", "abc"; does die because there is one full match, and then a partial match of the sub pattern (a has matched but not x). unpack "(xa)*", "abc"; still doesn't die because although the match is partial, it fails on 'a' which communicates the failure by returning an empty string (which is not a valid value for "a")

    This means you can always tell if there was a partial match. If the pattern matched partially and failed on a token that isn't x, you will get an output that is incompatible with the pattern (eg '.' for a2). If the pattern matched partially because it couldn't skip a byte, it will die. In all other cases, the match was complete.

      Hi,

      Thanks for the clear explanation. So in the circumstances where I need to seek at the start of my match, and I'm extracting as many as possible (with a (...)* group), what is the best approach?

      I could assign the results to an array within an eval {...} block to catch the error, but when I tried this it would still die and not return the successful matches.

        You could use the pattern "x4 (NN X8 N x4 /a N)*". Instead of skipping the length to get it later, you would fetch it twice and use it once. And at least in that case, the pattern doesn't fail on a x (actually, since the x4 in the parentheses skips the bytes read by the second N, you know that there is something to skip.

        Though actually, the fact that this fails is a good thing, because you know here that for some reason, after the last chunk, there are still some bytes (between 1 and 3) that lets x4 skip at least once, but not four times in a row. IE, your data is invalid. Try unpack "H*", pack "H*", <DATA>;. It looks like pack isn't very smart with the \n at the end of the string.

        Another approach is to preprocess your string with a regex to remove the troublesome byte(s).
        $str =~ s/^((?:...)+).{0,2}$/$1/;

        UPDATE: Corrected typo noticed by kcott.

        Bill
Re: Seeking with 'x' in unpack and out of bounds reads
by Anonymous Monk on Apr 27, 2018 at 16:26 UTC

    Well, pack/unpack has its own limitations. The template might be too cumbersome even if it works. A good, expressive template is usable both for pack() and unpack(). You've already gone into hacks territory with those Xx-s (to unpack a byte string not immediately preceded by its length).

    If you want to discard some value, the usual way is to

    my ($foo, undef, $bar, $baz) = unpack ...;