in reply to Seeking with 'x' in unpack and out of bounds reads

unpack, unlike regexes, allows partial matches. Eg, with $str = "abc", $str =~ /(.{2})*/; will match "ab" but unpack "(a2)*", "abc" will return "ab", "c". In the case of unpack, the second iteration of the sub pattern "a2" is partial. Without 'x', you can tell that a match is partial in unpack because the output is incompatible with the pattern. In my example, you expected 2 characters but got one, so this has to be a partial match. However, 'x' doesn't have a direct effect on the output, so a partial match involving 'x' must be communicated through an error.

Note that cases like unpack "(xa)*", "abcd"; don't die because this is not a partial match, the sub pattern "xa" matches exactly 2 times, which is a valid value for *. unpack "(ax)*", "abc"; does die because there is one full match, and then a partial match of the sub pattern (a has matched but not x). unpack "(xa)*", "abc"; still doesn't die because although the match is partial, it fails on 'a' which communicates the failure by returning an empty string (which is not a valid value for "a")

This means you can always tell if there was a partial match. If the pattern matched partially and failed on a token that isn't x, you will get an output that is incompatible with the pattern (eg '.' for a2). If the pattern matched partially because it couldn't skip a byte, it will die. In all other cases, the match was complete.

Replies are listed 'Best First'.
Re^2: Seeking with 'x' in unpack and out of bounds reads
by mxb (Pilgrim) on Apr 27, 2018 at 12:31 UTC

    Hi,

    Thanks for the clear explanation. So in the circumstances where I need to seek at the start of my match, and I'm extracting as many as possible (with a (...)* group), what is the best approach?

    I could assign the results to an array within an eval {...} block to catch the error, but when I tried this it would still die and not return the successful matches.

      You could use the pattern "x4 (NN X8 N x4 /a N)*". Instead of skipping the length to get it later, you would fetch it twice and use it once. And at least in that case, the pattern doesn't fail on a x (actually, since the x4 in the parentheses skips the bytes read by the second N, you know that there is something to skip.

      Though actually, the fact that this fails is a good thing, because you know here that for some reason, after the last chunk, there are still some bytes (between 1 and 3) that lets x4 skip at least once, but not four times in a row. IE, your data is invalid. Try unpack "H*", pack "H*", <DATA>;. It looks like pack isn't very smart with the \n at the end of the string.

        If rogue chunk is e.g. 7 bytes long, then unpacking with the proposed template will die on "X", so wrapping into eval is required anyway if data are unreliable.

        However, 'x' doesn't have a direct effect on the output, so a partial match involving 'x' must be communicated through an error.

        Looks to me like an attempt to whitewash inconsistent Perl's behaviour :-), By similar reasoning, failure to unpack e.g. Pascal strings (as "unpack 'C/a', qq(\03ab)") should be fatal, I think.

        Side-note: PNG tags were made human-readable for a good reason, so perhaps "A4" instead of "N" (or "L") will serve better. E.g., if data are super-reliable (CRC sums to be ignored), then chunks can be read into a hash:

        my ( $head, %chunks ) = unpack 'a8 (x4 A4 X8 N x4 /a x4)*', $input; say for keys %chunks;
      Another approach is to preprocess your string with a regex to remove the troublesome byte(s).
      $str =~ s/^((?:...)+).{0,2}$/$1/;

      UPDATE: Corrected typo noticed by kcott.

      Bill