mxb has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I'm onto move advanced regular expressions now and I'm playing with (?{...})

As a learning exercise I'm attempting to parse PNG files with a regex. Again, emphasis on the learning exercise :)

After an 8 byte header, PNG files contains a sequential list of chunks. Each chunk consists of a 32-bit big endian length field, 32-bit ASCII tag, n bytes of data, where n is the value set in the length field, and a 32-bit big endian CRC. The full specification, should you require it can be found here on w3.org

I have created the following test code, which loads a PNG from __DATA__, skips the header and attempts to extract chunks with my regex.

I am successfully unpacking the length field within the (?{...})construct and extract the data to a scalar. I am also attempting to move the position with the regex by assigning to pos(). While the assignment seems to work, as in there are no errors, it seems to have no affect. I am extracting the first four bytes of the data with the CRC value.

Is it not possible to move the position within the regex within a (?{...}) block? I originally thought to extract the data with a (?<data>.{$len}) construct, where $len is set within the (?{...}) but this also didn't work. I assume this is because the value of $len is set during regex compilation? (which would be undef)

I've also attempted to make a version which uses the (??{...}) construct, but this is very new to me. This seems simpler as you can unpack the length field and return a string which matches that many bytes, but it has an odd issue with matching the last chunk. Not too sure if I should include it within this node, but advice for why this doesn't work is also much appreciated.

Many thanks

(?{...}) version:

#!/usr/bin/env perl use strict; use warnings; use 5.016; # Parsing vars my ($data); # Rebuild PNG image local $/; my $input = <DATA>; chomp $input; $input = pack "H*", $input; # Skip PNG Header $input = substr( $input, 8 ); while ($input =~ /# four byte length (?<len>....) # four byte tag (?<tag>....) # variable length data (?{ # unpack length my $len = unpack "N", $+{len}; # extract data $data = substr($_, pos(), $len); # update position pos() += $len}) # four byte crc (?<crc>....)/mgx) { say "Chunk len:", unpack "N", $+{len}; say "Chunk tag:", $+{tag}; say "Chunk data:", unpack "H*", $data; say "Chunk data len:", length $data; say "Chunk CRC: ", unpack "H*", $+{crc}; say "---"; } __END__ __DATA__ 89504e470d0a1a0a0000000d4948445200000010000000100803000000282d0f530000 +015c504c544547704c4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb5 +4f7eb54f7eb44f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb34f7eb54f7eb5 +4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb44f7 +eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb5 +4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb5 +4f7eb54f7eb54f7eb54f7eb34f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7 +eb44f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb4 +4f7eb54f7eb54f7eb5a29b27fc0000007374524e5300242d75280b02fa8bae61fdf7f +bf5e266d109dd6462faf9fc6b0dde6571b3d9b0031f01e0cd83d64505928fc51eb405 +582a426d052f12e5840e63ecea046ab248d0f6327c3e0c4ff311b67960822ef295f8a +093af76e3683c3d7213eb02192cbb8106f190db5ad42cc19ba7504c36150ec28de169 +2647000000cb4944415418d3636040005b264b0614e0e7e50c24f595617ccf201e5f3 +706062e16289f8d39b0d83fc09521980bc237737011e6f370b260080fb11390111195 +55b0b6f7f671b7d2658808e3e665176397282e2e766404298de229860173539040226 +f71b18db43848401eac222db9983d9445925385bb584e0a2490915a1c13c9c0c0aac1 +5fccc10612c8e1e74b8806d28682eacc60fb8b84d3536281b4910ea731588043303b3 +30e481ba8ea690b8104720b05b2e2813413879a96124840282fbf2009c4d034615504 +520015a424be37c5c7be0000000049454e44ae426082

(??{...}) version:

#!/usr/bin/env perl use strict; use warnings; use 5.016; our $VERSION = 0.1; # Rebuild PNG image local $/; my $input = <DATA>; chomp $input; $input = pack "H*", $input; # Skip PNG Header $input = substr( $input, 8 ); while ( my ( $len, $tag, $data, $crc ) = $input =~ / # four byte length (....) # four byte tag (....) # variable length data (??{ # unpack length my $len = unpack "N", $1; print STDERR "$2 $len\n"; return "(.){$len}"}) # four byte crc (....)/gx ) { say "Chunk len:", unpack "N", $len; say "Chunk tag:", $tag; say "Chunk data:", unpack "H*", $data; say "Chunk data len:", length $data; say "Chunk CRC: ", unpack "H*", $crc; } __DATA__ 89504e470d0a1a0a0000000d4948445200000010000000100803000000282d0f530000 +015c504c544547704c4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb5 +4f7eb54f7eb44f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb34f7eb54f7eb5 +4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb44f7 +eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb5 +4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb5 +4f7eb54f7eb54f7eb54f7eb34f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7 +eb44f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb4 +4f7eb54f7eb54f7eb5a29b27fc0000007374524e5300242d75280b02fa8bae61fdf7f +bf5e266d109dd6462faf9fc6b0dde6571b3d9b0031f01e0cd83d64505928fc51eb405 +582a426d052f12e5840e63ecea046ab248d0f6327c3e0c4ff311b67960822ef295f8a +093af76e3683c3d7213eb02192cbb8106f190db5ad42cc19ba7504c36150ec28de169 +2647000000cb4944415418d3636040005b264b0614e0e7e50c24f595617ccf201e5f3 +706062e16289f8d39b0d83fc09521980bc237737011e6f370b260080fb11390111195 +55b0b6f7f671b7d2658808e3e665176397282e2e766404298de229860173539040226 +f71b18db43848401eac222db9983d9445925385bb584e0a2490915a1c13c9c0c0aac1 +5fccc10612c8e1e74b8806d28682eacc60fb8b84d3536281b4910ea731588043303b3 +30e481ba8ea690b8104720b05b2e2813413879a96124840282fbf2009c4d034615504 +520015a424be37c5c7be0000000049454e44ae426082

output:

IHDR 13 PLTE 348 tRNS 115 IDAT 203 DAT 52041 Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/ +(.){ <-- HERE 52041}/ at regex2.pl line 17, <DATA> chunk 1.

To me, this looks like the IDAT chunk match is failing so it's backtracking and seeking forwards

Replies are listed 'Best First'.
Re: Modifying pos() from within (?{...})
by haukex (Archbishop) on Apr 26, 2018 at 11:56 UTC
    I am also attempting to move the position with the regex by assigning to pos(). While the assignment seems to work, as in there are no errors, it seems to have no affect.

    Yes, this is documented in pos:

    pos directly accesses the location used by the regexp engine to store the offset, so assigning to pos will change that offset, and so will also influence the \G zero-width assertion in regular expressions. Both of these effects take place for the next match, so you can't affect the position with pos during the current match, such as in (?{pos() = 5}) or s//pos() = 5/e.

    As for the general approach, I think you should probably anchor the match to the end of the previous one with \G, use the /gc modifiers, and check for whether the end of string was reached:

    my $string = " foo bar quz "; pos($string)=undef; # just to play it safe while ( $string =~ / \G \s* (foo|bar|quz) \s* /xgc) { print "<$1>\n"; } die "match failed at pos ".pos($string) unless pos($string)==length($string);

    Although of course a regex is probably not the right tool here, unpack is probably better.

      Thanks for this, it's clear and explains why the (?{...}) solution didn't (and could never) work.

      I've not used \G and /c before. From reading perlre I now understand that \G essentially 'anchors' that point to index pos() into the string. I'm a little confused by the description for the /c modifier. From perlre v5.24:

      c - keep the current position during repeated matching

      and from the referenced perlretut

      A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the "//c", as in "/regexp/gc".

      I assume then, should the data I'm parsing with the regex be well formatted, then pos() will never be reset and so the /c modifier is not needed? But were any matches fail, things would break

      Is this a correct understanding of \G and /c?

        I assume then, should the data I'm parsing with the regex be well formatted, then pos() will never be reset and so the /c modifier is not needed?

        Note that the condition on the while loop is the regex, so the loop will run while the match is true, so the last match executed will always be a failed one. The question then is why it failed, because it reached the end of string (= successful overall parse) or it did not, which is why I compare pos to length - but for pos to be available there, I need /c.

        Another use for /c is described in "\G assertion" in perlop (under "Regexp Quote-Like Operators"): basically, attempting to apply multiple different regexes at the same point in a string until you find one that matches.

Re: Modifying pos() from within (?{...})
by tybalt89 (Monsignor) on Apr 26, 2018 at 12:46 UTC

    A few tweaks.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1213595 use strict; use warnings; use 5.016; our $VERSION = 0.1; # Rebuild PNG image local $/; my $input = <DATA>; chomp $input; $input = pack "H*", $input; # Skip PNG Header $input = substr( $input, 8 ); while ( $input =~ /\G # four byte length (....) # four byte tag (....) # variable length data ((??{ # unpack length my $length = unpack "N", $1; print STDERR "$2 $length\n"; ".{$length}"})) # four byte crc (....)/gcsx ) { my ( $len, $tag, $data, $crc ) = ($1, $2, $3, $4); say "Chunk len:", unpack "N", $len; say "Chunk tag:", $tag; say "Chunk data:", unpack "H*", $data; say "Chunk data len:", length $data; say "Chunk CRC: ", unpack "H*", $crc; } __DATA__ 89504e470d0a1a0a0000000d4948445200000010000000100803000000282d0f530000 +015c504c544547704c4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb5 +4f7eb54f7eb44f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb34f7eb54f7eb5 +4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb44f7 +eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb5 +4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb5 +4f7eb54f7eb54f7eb54f7eb34f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7 +eb44f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb4 +4f7eb54f7eb54f7eb5a29b27fc0000007374524e5300242d75280b02fa8bae61fdf7f +bf5e266d109dd6462faf9fc6b0dde6571b3d9b0031f01e0cd83d64505928fc51eb405 +582a426d052f12e5840e63ecea046ab248d0f6327c3e0c4ff311b67960822ef295f8a +093af76e3683c3d7213eb02192cbb8106f190db5ad42cc19ba7504c36150ec28de169 +2647000000cb4944415418d3636040005b264b0614e0e7e50c24f595617ccf201e5f3 +706062e16289f8d39b0d83fc09521980bc237737011e6f370b260080fb11390111195 +55b0b6f7f671b7d2658808e3e665176397282e2e766404298de229860173539040226 +f71b18db43848401eac222db9983d9445925385bb584e0a2490915a1c13c9c0c0aac1 +5fccc10612c8e1e74b8806d28682eacc60fb8b84d3536281b4910ea731588043303b3 +30e481ba8ea690b8104720b05b2e2813413879a96124840282fbf2009c4d034615504 +520015a424be37c5c7be0000000049454e44ae426082

      Oh! of course! I was returning "(.){$len}", which is $len number of single character capture groups. What I actually wanted was "(.{$len})" - a single capture of length $len.

      I also see you added the missing /s modifier, which explains why it was stopping mid way through.

      That makes the (??{...}) issue much clearer. Thanks!

Re: Modifying pos() from within (?{...})
by Anonymous Monk on Apr 26, 2018 at 13:42 UTC

    Ah, tybalt89 beat me to it. Some fixes are needed to your regex

    • You want to match in scalar context, otherwise the m// will return all the groups of all the matches, and the while will loop forever
    • You will most definitely want an /s modifier to your regex, otherwise the dot will not match newline characters in the data
    • The regex engine has a limitation of 32767 to the repeat count; a workaround is needed

    Here's one version:

    #!/usr/bin/env perl use strict; use warnings; use 5.016; our $VERSION = 0.1; # Rebuild PNG image local $/; my $input = <DATA>; chomp $input; $input = pack "H*", $input; # Skip PNG Header $input = substr( $input, 8 ); $| = 1; while ( $input =~ m{ # four byte length (....) # four byte tag (....) # variable length data ((??{ # unpack length my $len = unpack "N", $1; print STDERR "$2 $len\n"; ".{30000}" x ($len/30000) . ".{@{[$len%30000]}}" })) # four byte crc (....) }gsx ) { my ( $len, $tag, $data, $crc ) = ($1, $2, $3, $4); say "Chunk len:", unpack "N", $len; say "Chunk tag:", $tag; say "Chunk data:", unpack "H*", $data; say "Chunk data len:", length $data; say "Chunk CRC: ", unpack "H*", $crc; } __DATA__ 89504e470d0a1a0a0000000d4948445200000010000000100803000000282d0f530000 +015c504c544547704c4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb5 +4f7eb54f7eb44f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb34f7eb54f7eb5 +4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb44f7 +eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb5 +4f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7 +eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb44f7eb5 +4f7eb54f7eb54f7eb54f7eb34f7eb54f7eb54f7eb54f7eb54f7eb44f7eb54f7eb54f7 +eb44f7eb54f7eb54f7eb44f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb54f7eb4 +4f7eb54f7eb54f7eb5a29b27fc0000007374524e5300242d75280b02fa8bae61fdf7f +bf5e266d109dd6462faf9fc6b0dde6571b3d9b0031f01e0cd83d64505928fc51eb405 +582a426d052f12e5840e63ecea046ab248d0f6327c3e0c4ff311b67960822ef295f8a +093af76e3683c3d7213eb02192cbb8106f190db5ad42cc19ba7504c36150ec28de169 +2647000000cb4944415418d3636040005b264b0614e0e7e50c24f595617ccf201e5f3 +706062e16289f8d39b0d83fc09521980bc237737011e6f370b260080fb11390111195 +55b0b6f7f671b7d2658808e3e665176397282e2e766404298de229860173539040226 +f71b18db43848401eac222db9983d9445925385bb584e0a2490915a1c13c9c0c0aac1 +5fccc10612c8e1e74b8806d28682eacc60fb8b84d3536281b4910ea731588043303b3 +30e481ba8ea690b8104720b05b2e2813413879a96124840282fbf2009c4d034615504 +520015a424be37c5c7be0000000049454e44ae426082


    But using unpack is better for, ahem, unpacking binary formats. You could write something like:

    my ($tag, $data, $crc) = unpack q( (x4L X8Lx4/a* L)> ), $input;
    Place ( )* around the template to get all the chunks.

      Thanks for your help also!

      But using unpack is better for, ahem, unpacking binary formats. You could write something like:

      my ($tag, $data, $crc) = unpack q( (x4L X8Lx4/a* L)> ), $input;

      Place ( )* around the template to get all the chunks.

      Wow, that's concise and reasonably clear. I'm learning regex and more advanced pack|unpack at the same time so it's definitely something I need to dig into deeper.

      Armed with the pack documentation I dissected your pack string into the following:

      x4 - four null bytes (effectively skip the length fields)

      L - 32bit value (the tag)

      X8 - seek backwards 8 bytes (back to the start of the chunk

      L - 32bit value (data length)

      x4 - four null bytes (skip the tag field)

      / - pops the last value (data length) off the extraction stack and uses it as a repetition count for the next item

      a* - binary data (count is set by the / above); not sure why * follows it, it works when removed

      L - 32 bit CRC

      It seems that I need to go through perlpacktut next!

      I do have a small question however, when I try to repeat the group with * I get the following error:

      'x' outside of string in unpack

      Maybe this is something to do with the x at the start of the group? It's trying to read a new group starting just past the end of the string?

      edit:Changing the first part of the pack string to L seemed to fix the 'outside of string' error. The data length is now explicitly within the returned values, but could be ignored later.

      edit2:I span this out into a separate thread as it was getting too far off the original topic.

        I already answered in the second thread, that the $input has an extra \n at the end, and I just realized why. The funny thing is, I reached that conclusion for the wrong reason. When I did my test I didn't modify $/ and didn't call chomp, so the \n wasn't removed. You, on the other hand did have chomp, but with $/ set to undef. The chomp doc states:

        When in slurp mode ($/ = undef ) or fixed-length record mode ($/ is a reference to an integer or the like; see perlvar), chomp won't remove anything.

        So in my tests by forgetting to call chomp, I got the same result you had by disabling its effect, and therefore reached the correct conclusion :D

      Oops, that * in the unpack template was redundant.

      my @chunks = unpack '( (x4L X8Lx4/a L)> )*', $input; while ( my ($tag, $data, $crc) = splice(@chunks, 0, 3) ) { ... }