diotalevi has asked for the wisdom of the Perl Monks concerning the following question:

I'm working with a data format where the file consists of contiguous binary records. The first two bytes are a packed integer and contain the record length inclusive of the packed integer. This means that for 142 bytes of data the record length is 144 (to include the leading two bytes as well).

I figured out that I could use the single unpack format '(n/a)*' if the record length didn't include those two bytes. I tried modifying that format to read '(n/a XX)*' to get unpack to step back two bytes so as to align correctly. That doesn't parse as valid perl. I ended up writing a while() loop to get the job done but I'm left wondering if there was a way to word that unpack format so it would work correctly. Any ideas?

# Sample working while() loop $sourceMetacodeLen = length $sourceMetacode; $sourceMetacodePos = 0; while ($sourceMetacodePos < $sourceMetacodeLen) { $recordLen = unpack( 'n', substr $sourceMetacode, $sourceMetacodePos, 2); $record = substr( $sourceMetacode, $sourceMetacodePos + 2, $recordLen - 2); # not relevant to the example # $parse .= ${translate_record(\$record, $sourceMetacodePos, \@ +fonts)}; $sourceMetacodePos += $recordLen; }

And here is an example of what the data looks like (after adding some newlines)

$ dd if=3012034-1.met bs=1 count=144|vis
\^@\M^P\^@\^@+$DJDE$   FONTS=(UN104B,HE18BP,HE06NP,HE08OP,HE08NP,HE09BP,
HE10BP,HE14BP,HE10VP,HE12NP,BLANKP,FORMSX,C395L ,HE12BP,HE36SP,HE08BP,HE
10NP),; \^A

$ od -x 3012034-1.met | head
0000000     9000    0000    242b    4a44    4544    2024    2020    4f46
0000020     544e    3d53    5528    314e    3430    2c42    4548    3831
0000040     5042    482c    3045    4e36    2c50    4548    3830    504f
0000060     482c    3045    4e38    2c50    4548    3930    5042    482c
0000100     3145    4230    2c50    4548    3431    5042    482c    3145
0000120     5630    2c50    4548    3231    504e    422c    414c    4b4e
0000140     2c50    4f46    4d52    5853    432c    3933    4c35    2c20
0000160     4548    3231    5042    482c    3345    5336    2c50    4548
0000200     3830    5042    482c    3145    4e30    2950    3b2c    0120
$

__SIG__
printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE

Replies are listed 'Best First'.
Re: Unpacking fixed length records
by shenme (Priest) on Oct 06, 2002 at 22:46 UTC
    I may be still in shock at grouping in unpack (new in 5.8?), but will this work for you? (Please excuse the clumsy dumping)
      $str = "\004ABC\003DE\002F";
      $fmt = "(C X /a)*";
      @z = unpack($fmt,$str);
      foreach $z (@z) {
        print "'", join("', '", map {ord} split(//,$z)),"'\n";
      }
    
    returns
    '4', '65', '66', '67'
    '3', '68', '69'
    '2', '70'
    
    So you could probably get your desired result with "(n XX /a)*" ?

    --
    I'm a pessimist about probabilities; I'm an optimist about possibilities.
    Lewis Mumford

      Ok I get it. I can do unpack '(nX/a)*' but apparently more than one 'X' isn't supported. The pattern I would actually use is '(nXX/a)*' and that doesn't parse. Boo hoo! This almost looks like a thinko on the part of whoever implemented this for 5.8.0. It's a nice feature but I don't know how many people really want to be limited to packed strings of 255 chars or less.

      __SIG__ printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B:: +svref_2object(sub{})->OUTSIDE
        B Bu But it works for me? My "perl -v" insists it is 5.8.0. What version of Perl are you running? I changed the test program input and format to more closely mirror what you wanted.
          $str = "\000\005ABC\000\004DE\000\003F";
          $fmt = "(n XX /a)*";
        
        returns
        '0', '5', '65', '66', '67'
        '0', '4', '68', '69'
        '0', '3', '70'
        
        Also tried it with "(nXX/a)*" with same results. When you run the same program does it get different results or a syntax error? You said it doesn't parse? As in it doesn't split the input correctly? Could it be a big/little-endian problem?

        (shenme wants to know if the battery in his new toy has run down)

Re: Unpacking fixed length records
by runrig (Abbot) on Oct 06, 2002 at 22:34 UTC
    You can't unpack variable length records like that (you say 'fixed length' in the title, but I'd call them variable if the first two bytes tell you how long the rest of the record is). You could read in the first two bytes, and then use that to read in the rest of the record using read.

    Update: /me is corrected by jmcnamara in CB and goes to reread pack and the use of "/" in the template...though it still doesn't quite work for the OP.

Re: Unpacking fixed length records
by Anonymous Monk on Oct 07, 2002 at 21:57 UTC
    Point 1: This is variable data, not fixed.

    Point 2:

    How about something like:
    open file;
    $Header_length = 2;
    while(sysread (file, $data, $Header_length)){
    $mask = 'xyz';
    $body_length = unpack $mask $data;

    sysread (file, $data, $body_length);
    $mask= 'MNO';
    $body=unpack $mask $data;
    ...
    }

      I goofed and called it "Fixed length records" when what would have made more sense was Run Length Encoded (RLE). Oh well. Yes, you are correct in that approach would work - I'm trying to understand how to make it work from an unpack format example. I did recently discover that it's not the format that's bad - something is odd about my data that's breaking the unpack() call. Not that it matters but I can't really do read() and sysread() since I've just grabbed the entire 10K string from a call to /usr/bin/uncompress. See 203336 and 203230 for more on that. I'll probably spend an hour tonight and follow a hexdump of the data to see where unpack is dying. As far as I know there's no way to debug unpack and see where it's falling down. It just either works or it doesn't (and it's damn annoying).

      __SIG__ printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B:: +svref_2object(sub{})->OUTSIDE
        You could separate the parts of the unpack format, making sure to manually consume each part as you go along. It's more work than should be necessary of course - unpack should tell us where it failed, but alas, it doesn't. So you could repeatedly call something like
        my @list = eval { unpack "x$skip $format", $data }; { die "$@" || last }
        making sure that $skip contains the correct value between calls.

        Makeshifts last the longest.