dwalin has asked for the wisdom of the Perl Monks concerning the following question:

hail oh reverend gurus, a lowly initiate seeks advice at your everlasting fountain of knowledge. :)

there's a task of reading binary data from a file, parsing it and printing it out in csv text file. there are two known formats of binary file, both documented. in perspective, there may be more formats so i would like to keep parsing code as generalized as possible. the holy unpack is good but there's a catch (as always). the data record format is about that:
the catch itself is that all the data is little-endian but the platform used to run a software producing such binary data is big-endian (sparc64). the conversion script should run on the sparc.

so there's my humble question: is there any way to do the parsing elegantly? i did the parsing quick and dirty way but i know that's a sin. my view of heaven for this case is the holy unpack to be able to parse the whole record in one pass. i tried to meditate on perlpacktut for three days but nothing comes to my mind short of divine intervention in form of some signed "v" template and that doesn't solve bit-string matter in which every bit is a separate field...

oh, and another complication: the system that produces the binary data is proprietary and it's using solaris 9 as a platform so it would be best to avoid any modules not built in perl 5.6.1 that is already there. no cc. :( ideally it should be a small beautiful self-containing perl hack but... i feel i have reached the limit of my humble perl knowledge so i ask your advice, oh gurus!
  • Comment on yet another "reading binary data" question

Replies are listed 'Best First'.
Re: yet another "reading binary data" question
by pc88mxer (Vicar) on May 07, 2008 at 16:42 UTC
    Are the quantities x, y, z and n known beforehand? You can do most of this with a single unpack call. Some of the values might need to be fixed up after its parsed.

    Here's probably how I would go about it:

    my ($x, $y, $z, $n, $len) = ... # fill in these parameters my $format = "V$x n$y n C$z ".("Z$len" x $n); my $recsize = 4*$x + 2*$y + 2 + $z + ($n*$len); open B, '<', 'binary_file.bin' or die ...; while (read(B, $buf, $recsize) == $recsize) { my @values = unpack($format, $buf); my @x = splice(@values, 0, $x); # snip off the 4-byte ints my @y = splice(@values, 0, $y); # the 2-byte ints my $bits = splice(@values, 0, 1); my @z = splice(@values, 0, $z); # the 1-byte ints # @values now contain just the null-terminated strings # manually fix up signed vs. unsigned shorts in @y: $y[1] -= 65536 if ($y[1] > 32767); # repeat for each signed valued ...process the record... }
    Admittedly, fixing up the shorts is a kludge. Unfortunately, there doesn't seem to be a format for big-endian signed values, so that's why I opted for this approach. I like to use the platform-independent formats n and v for shorts as opposed to the platform-dependent formats s and S to make the code, well, platform-independent.

    Update: Fixed format based on alexm's comment.

      my $format = "v$x n$y n C$z ".("Z$len" x $n); my $recsize = 2*$x + 2*$y + 2 + $z + ($n*$len);
      Shouldn't it be $format = "V$x ..." and $recsize = 4*$x + ... ?
        Yeah, you're right - got too preoccupied with the shorts and forgot about the longs.
      yes, indeed the record format is known beforehand. it is fixed and all field quantities and overall record length is known. in fact, already i have something resembling your code however slightly more generalized. but it doesn't really matter if i can't use one unpack for them all, one kludge or two is still kludgy code. i'm not after effectiveness this time, more like beauty. :)
      thanks for the reply, anyway.
Re: yet another "reading binary data" question
by ikegami (Patriarch) on May 07, 2008 at 23:29 UTC

    Maybe not in one pass, since there's no way to unpack a 16-bit signed little-ending ints.

    32-bit unsigned little-ending int: unpack('V', $_)
    32-bit signed little-ending int: unpack('l', pack('L', unpack('V', $_)))
    16-bit unsigned little-ending int: unpack('v', $_)
    16-bit signed little-ending int: unpack('s', pack('S', unpack('v', $_)))

    However, it's easy to have pack to most of the work, then just touch up the results elegantly.

    $_ = "\x32\x54\x76\x98" # 2557891634 as a 32-bit unsigned LE int . "\x32\x54\x76\x98" # -1737075662 as a 32-bit signed LE int . "\x76\x98" # 39030 as a 16-bit unsigned LE int . "\x76\x98"; # -26506 as a 16-bit signed LE int my @nums = unpack('VVvv', $_); $_ = unpack('l', pack('L', $_)) for @nums[1]; # Fix signs of longs. $_ = unpack('s', pack('S', $_)) for @nums[3]; # Fix signs of shorts.

    As for the bits, you didn't specify what you wanted to do with them, You could leave them grouped and use masks, or you could seperate them. The former is trivial (unpack('v', $_)). The latter is a bit more complicated, but not that much:

    $_ = "\x34\x01"; my @nums = unpack('s', $_); my @flags = ( split //, unpack('B16', pack('n', $nums[0])) )[16-9 .. 1 +6-1];

    Tested on a little-endian system, but it should work equally well on a big-endiang system.

      thanks a lot, i never thought of converting unsigned to signed using pack/unpack. :) as for the bits, i need them separated.
Re: yet another "reading binary data" question
by TGI (Parson) on May 07, 2008 at 18:15 UTC

    If you can get it onto your system, Convert::Binary::C ought to handle the task nicely.


    TGI says moo

      maybe it does but i have no cc on that machine and installing gcc is not an option. that would mean i have to reinstall perl (built with gcc instead of sun cc) and half a system as well with it. the software that sits on top of that solaris box is kludgy enough not to want messing with it.

      honestly i hoped someone would point out something i missed in the unpack manual...

        I do most of my work on Win32 systems, so I know all too well about kludgy systems that lack c-compilers. The situation sucks badly. That's why I said, "if you can get it installed."

        IIRC, ActiveState has a Solaris version of ActivePerl. Perhaps you could use that as a way to get a modern perl with binary module availability. You may even want to consider using Perlapp to make app bundles on another system.

        For that matter, can you build your code on another solaris box and use PAR::Packager to make an executable bundle? If you have another suitable system, this approach seems rather appealing.

        Sadly, I know of no deep pack/unpack magic to help you.


        TGI says moo

Re: yet another "reading binary data" question
by dwalin (Monk) on May 07, 2008 at 20:10 UTC
    it seems that heavens have listened to my prayers! at least half of them, anyway... in perl 5.10.0 there's a new modifier for pack/unpack that allows one to force big- or little-endian explicitly. means that i could solve signed short part of the problem. no bright idea what to do with bit strings, though. and that's 5.10.0... *sigh*

      I did recall correctly (in my response above), you can get ActivePerl 5.10 and 5.8 for Solaris. Whether it will work with the version you've got, and how invasive the install is, I don't know. But it may be worth checking out.


      TGI says moo

Re: yet another "reading binary data" question
by apl (Monsignor) on May 07, 2008 at 16:39 UTC
    Why not write your own version of unpack which would support (in the template) 9 bit fields and little-endian integers?

    I realize this is a non-trivial task, but once written it's trivial to modify as the record format changes. (That is, your version of unpack would not change, only the calls to it.)

      truth to say i haven't the first idea how to approach that. i mean, i'm not really a perl hacker, i just use it now and then to solve my little tasks. :) i'm not totally non-programmer kind but write something as magic as unpack is definitely beyond my scope. or at least it would take so much time and effort that it's not worth it.
      thanks for the advice, though. :)
Re: yet another "reading binary data" question
by GrandFather (Saint) on May 07, 2008 at 21:30 UTC

    We can't do much to polish code we can't see. Perhaps you need to show us a sample of what you have got?


    Perl is environmentally friendly - it saves trees
      ok, it's kludgy and non-perlish but here it goes:
      # # ECHI R11 (standard). # $ECHI_R11_LEN = 323; @ECHI_R11_FMT = ( ['CALLID', 'V', 0, 4], ['ACWTIME', 'V', 4, 4], ['ANSHOLDTIME', 'V', 8, 4], ['CONSULTTIME', 'V', 12, 4], ['DISPTIME', 'V', 16, 4], ['DURATION', 'V', 20, 4], ['SEGSTART', 'D', 24, 4], ['SEGSTOP', 'D', 28, 4], ['TALKTIME', 'V', 32, 4], ['NETINTIME', 'V', 36, 4], ['ORIGHOLDTIME', 'V', 40, 4], ['DISPIVECTOR', 'v', 44, 2], ['DISPSPLIT', 'v', 46, 2], ['FIRSTVECTOR', 'v', 48, 2], ['SPLIT1', 'v', 50, 2], ['SPLIT2', 'v', 52, 2], ['SPLIT3', 'v', 54, 2], ['TKGRP', 'v', 56, 2], ['EQ_LOCID', 'v', 58, 2], ['ORIG_LOCID', 'v', 60, 2], ['ANS_LOCID', 'v', 62, 2], ['OBS_LOCID', 'v', 64, 2], ['ASSIST', 'b', 66, 1, 0], ['AUDIO', 'b', 66, 1, 1], ['CONFERENCE', 'b', 66, 1, 2], ['DA_QUEUED', 'b', 66, 1, 3], ['HOLDABN', 'b', 66, 1, 4], ['MALICIOUS', 'b', 66, 1, 5], ['OBSERVINGCALL', 'b', 66, 1, 6], ['TRANSFERRED', 'b', 66, 1, 7], ['AGT_RELEASED', 'b', 67, 1, 0], ['ACD', 'V', 68, 1], ['DISPOSITION', 'V', 69, 1], ['DISPPRIORITY', 'V', 70, 1], ['HELD', 'V', 71, 1], ['SEGMENT', 'V', 72, 1], ['ANSREASON', 'V', 73, 1], ['ORIGREASON', 'V', 74, 1], ['DISPSKLEVEL', 'V', 75, 1], ['EVENT1', 'V', 76, 1], ['EVENT2', 'V', 77, 1], ['EVENT3', 'V', 78, 1], ['EVENT4', 'V', 79, 1], ['EVENT5', 'V', 80, 1], ['EVENT6', 'V', 81, 1], ['EVENT7', 'V', 82, 1], ['EVENT8', 'V', 83, 1], ['EVENT9', 'V', 84, 1], ['UCID', 'Z', 85, 21], ['DISPVDN', 'Z', 106, 8], ['EQLOC', 'Z', 114, 10], ['FIRSTVDN', 'Z', 124, 8], ['ORIGLOGIN', 'Z', 132, 10], ['ANSLOGIN', 'Z', 142, 10], ['LASTOBSERVER', 'Z', 152, 10], ['DIALED_NUM', 'Z', 162, 25], ['CALLING_PTY', 'Z', 187, 13], ['LASTDIGITS', 'Z', 200, 17], ['LASTCWC', 'Z', 217, 17], ['CALLING_II', 'Z', 234, 3], ['CWC1', 'Z', 237, 17], ['CWC2', 'Z', 254, 17], ['CWC3', 'Z', 271, 17], ['CWC4', 'Z', 288, 17], ['CWC5', 'Z', 305, 17] ); # # ECHI R12 (expanded). # $ECHI_R12_LEN = 493; @ECHI_R12_FMT = ( ['CALLID', 'V', 0, 4], ['ACWTIME', 'V', 4, 4], ['ANSHOLDTIME', 'V', 8, 4], ['CONSULTTIME', 'V', 12, 4], ['DISPTIME', 'V', 16, 4], ['DURATION', 'V', 20, 4], ['SEGSTART', 'V', 24, 4], ['SEGSTOP', 'V', 28, 4], ['TALKTIME', 'V', 32, 4], ['NETINTIME', 'V', 36, 4], ['ORIGHOLDTIME', 'V', 40, 4], # # Extended ECHI R12 fields # ['QUEUETIME', 'V', 44, 4], ['RINGTIME', 'V', 48, 4], # # End of Extended ECHI R12 fields # ['DISPIVECTOR', 'v', 52, 2], ['DISPSPLIT', 'v', 54, 2], ['FIRSTVECTOR', 'v', 56, 2], ['SPLIT1', 'v', 58, 2], ['SPLIT2', 'v', 60, 2], ['SPLIT3', 'v', 62, 2], ['TKGRP', 'v', 64, 2], ['EQ_LOCID', 'v', 66, 2], ['ORIG_LOCID', 'v', 68, 2], ['ANS_LOCID', 'v', 70, 2], ['OBS_LOCID', 'v', 72, 2], # # Extended ECHI R12 field # ['UUI_LEN', 'v', 74, 2], # # End of Extended ECHI R12 field # ['ASSIST', 'b', 76, 1, 0], ['AUDIO', 'b', 76, 1, 1], ['CONFERENCE', 'b', 76, 1, 2], ['DA_QUEUED', 'b', 76, 1, 3], ['HOLDABN', 'b', 76, 1, 4], ['MALICIOUS', 'b', 76, 1, 5], ['OBSERVINGCALL', 'b', 76, 1, 6], ['TRANSFERRED', 'b', 76, 1, 7], ['AGT_RELEASED', 'b', 77, 1, 0], ['ACD', 'C', 78, 1], ['DISPOSITION', 'C', 79, 1], ['DISPPRIORITY', 'C', 80, 1], ['HELD', 'C', 81, 1], ['SEGMENT', 'C', 82, 1], ['ANSREASON', 'C', 83, 1], ['ORIGREASON', 'C', 84, 1], ['DISPSKLEVEL', 'C', 85, 1], ['EVENT1', 'C', 86, 1], ['EVENT2', 'C', 87, 1], ['EVENT3', 'C', 88, 1], ['EVENT4', 'C', 89, 1], ['EVENT5', 'C', 90, 1], ['EVENT6', 'C', 91, 1], ['EVENT7', 'C', 92, 1], ['EVENT8', 'C', 93, 1], ['EVENT9', 'C', 94, 1], ['UCID', 'Z*', 95, 21], ['DISPVDN', 'Z*', 116, 8], ['EQLOC', 'Z*', 124, 10], ['FIRSTVDN', 'Z*', 134, 8], ['ORIGLOGIN', 'Z*', 142, 10], ['ANSLOGIN', 'Z*', 152, 10], ['LASTOBSERVER', 'Z*', 162, 10], ['DIALED_NUM', 'Z*', 172, 25], ['CALLING_PTY', 'Z*', 197, 13], ['LASTDIGITS', 'Z*', 210, 17], ['LASTCWC', 'Z*', 227, 17], ['CALLING_II', 'Z*', 244, 3], ['CWC1', 'Z*', 247, 17], ['CWC2', 'Z*', 264, 17], ['CWC3', 'Z*', 281, 17], ['CWC4', 'Z*', 298, 17], ['CWC5', 'Z*', 315, 17], # # Extended ECHI R12 fields # ['VDN2', 'Z*', 332, 8], ['VDN3', 'Z*', 340, 8], ['VDN4', 'Z*', 348, 8], ['VDN5', 'Z*', 356, 8], ['VDN6', 'Z*', 364, 8], ['VDN7', 'Z*', 372, 8], ['VDN8', 'Z*', 380, 8], ['VDN9', 'Z*', 388, 8], ['ASAI_UUI', 'Z*', 396, 96] ); [...skipped...] read(INFILE, $buf, 4); $version = unpack("V", $buf); read(INFILE, $buf, 4); $sequence = unpack("V", $buf); if ($version == 12 || $version == 11) { print localtime() . " Processing file $ARGV[0], version $version, se +quence $sequence\n"; } else { die localtime() . " Unsupported file version $version, can't process +!\n" }; undef $buf; my $rec_len = $version == 12 ? $ECHI_R12_LEN : $ECHI_R11_LEN; my $echi_fmt = $version == 12 ? \@ECHI_R12_FMT : \@ECHI_R11_FMT; my $processed = 0; print_header(OUTFILE, $echi_fmt) unless !$PRINT_HEADER; while(read(INFILE, $buf, $rec_len)) { my %record = unpack_record($buf, $echi_fmt); print localtime() . " Processing record $processed, Call ID " . $rec +ord{'CALLID'} . ", Segment " . $record{'SEGMENT'} . "\n"; print_record(OUTFILE, \%record, $echi_fmt); $processed++; }; [...skipped...] sub unpack_record { my $buf = shift; my $echi_format = shift; my $echi_rec = undef; my %record; my $str = undef; my $val = undef; foreach $echi_field (@$echi_format) { my $echi_name = @$echi_field[0]; my $echi_type = @$echi_field[1]; my $echi_offset = @$echi_field[2]; my $echi_len = @$echi_field[3]; my $echi_bitoffset = @$echi_field[4]; my $str = undef; $str = substr($buf, $echi_offset, $echi_len); if ($echi_type ne 'b') { $val = unpack($echi_type, $str); } else { $val = vec($str, $echi_bitoffset, 8); }; if ($echi_name =~ /SEG(START|STOP)/) { $val = strftime($DATE_FORMAT, localtime($val)) unless !$DATE_FORMA +T; }; if ($echi_name =~ /SPLIT[1-3]/) { $val -= 65536 if ($val > 32767); + }; if ($echi_type =~ /Z/) { $record{$echi_name} = "\"" . $val . "\""; } else { $record{$echi_name} = $val; }; }; return %record; }; sub print_header { my $file = shift; my $echi_format = shift; my $line = ""; foreach $echi_field (@$echi_format) { $line .= @$echi_field[0] . ","; }; $line =~ s/,$//; print $file "$line\n"; }; sub print_record { my $file = shift; my $record = shift; my $echi_format = shift; my $line = ""; foreach $echi_field (@$echi_format) { $name = @$echi_field[0]; $line .= $record->{$name} . ","; }; $line =~ s/,$//; print $file "$line\n"; };

      *blush* i guess my old pascal habits stick out... but at least it works. :) now, i was always told to optimize lately so now that i got working script i would like to optimize the hell out of it. that's the real fun. :)
      Does anyone have an Perl Code for Echi Cms 16.2 Release ?
Re: yet another "reading binary data" question
by dwalin (Monk) on May 12, 2008 at 06:45 UTC
    okay, so after a bit of optimiziren™ now it looks like that:
    %ECHI = ( [...skipped...] 12 => # CMS R12 and above { length => 493, header => 'CALLID,ACWTIME,ANSHOLDTIME,CONSULTTIME,DISPTIME,DURATIO +N,SEGSTART,SEGSTOP,TALKTIME,NETINTIME,ORIGHOLDTIME,QUEUETIME,RINGTIME +,DISPIVECTOR,DISPSPLIT,FIRSTVECTOR,SPLIT1,SPLIT2,SPLIT3,TKGRP,EQ_LOCI +D,ORIG_LOCID,ANS_LOCID,OBS_LOCID,UUI_LEN,ASSIST,AUDIO,CONFERENCE,DA_Q +UEUED,HOLDABN,MALICIOUS,OBSERVINGCALL,TRANSFERRED,AGT_RELEASED,ACD,DI +SPOSITION,DISPPRIORITY,HELD,SEGMENT,ANSREASON,ORIGREASON,DISPSKLEVEL, +EVENT1,EVENT2,EVENT3,EVENT4,EVENT5,EVENT6,EVENT7,EVENT8,EVENT9,UCID,D +ISPVDN,EQLOC,FIRSTVDN,ORIGLOGIN,ANSLOGIN,LASTOBSERVER,DIALED_NUM,CALL +ING_PTY,LASTDIGITS,LASTCWC,CALLING_II,CWC1,CWC2,CWC3,CWC4,CWC5,VDN2,V +DN3,VDN4,VDN5,VDN6,VDN7,VDN8,VDN9,ASAI_UUI', format => 'V13 v12 x2 C17 A21 A8 A10 A8' . 'A10'x3 . 'A25 A13 A17 +A17 A3' . 'A17'x5 . 'A8'x8 . 'A96', bits => {index => 25, format => '@76b9'}, signed => [14, 16, 17, 18], segment => 38, strstart => 51 } ); [...skipped...] read(INFILE, $buf, 8); ($ver, $seq) = unpack("V2", $buf); die localtime() . " Unsupported file version $ver, can't process" unle +ss grep $ver, keys %ECHI; print localtime() . " Processing file $ARGV[0], version $ver, sequence + $seq"; print OUTFILE $ECHI{$ver}{header} if $PRINT_HEADER; my $processed = 0; while(read(INFILE, $buf, $ECHI{$ver}{length})) { my @data = unpack($ECHI{$ver}{format}, $buf); splice @data, $ECHI{$ver}{bits}{index}, 0, split(//, unpack($ECHI{$ver}{bits}{format}, $buf)); if ($DATE_FORMAT) { for (my $i = 6; $i < 8; $i++) { $data[$i] = strftime($DATE_FORMAT, localtime($data[$i])); } }; foreach my $index (@{$ECHI{$ver}{signed}}) { $data[$index] = unpack('s', pack('S', $data[$index])); }; for (my $i = $ECHI{$ver}{strstart}; $i <= $#data; $i++) { $data[$i] = '"' . $data[$i] . '"'; }; die localtime() . " Cannot write to file: $!" unless print OUTFILE join ',', @data; $processed++; print localtime() . " Processed record $processed, Call ID " . $data[0] . ", Segment " . $data[$ECHI{$ver}{segment}]; };
    i don't see here anything left to optimize. any thoughts?
      This is really cool voor ECH