Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone have suggestions for how to structure a script to read a binary data file?

The file in question is a results file from a finite element solver, so although it's structured its not as simple as a header + a big chunk of uniform data. Its got a load of scalar variable values then massive (multi-dimensional) arrays, and both the number of words per record and the format vary.

I'm finding it hard to identify an approach which is a good compromise for being:

  • Fairly self-documenting - e.g. using explicit variable names, which can then match the documentation for the file structure
  • Compact - no great gobs of identical repeated read() and unpack() statements
  • Fast - need to handle 20GB+ of data here.

    It feels like there should be some sort of lookup or grammar approach which can tell the script what it's got, how to unpack() it, and where to put it, but I'm not sure I know what that approach is...

    • Comment on Reading binary files - program structure
  • Replies are listed 'Best First'.
    Re: Reading binary files - program structure
    by BrowserUk (Patriarch) on May 20, 2010 at 01:02 UTC
      • Fairly self-documenting - e.g. using explicit variable names, which can then match the documentation for the file structure

        For explicit naming, you'll have to either use my blocks, or hash keys, or use constant. For efficiency and compactness, using constant names defined to indexes in an array is a good option for largish numbers of discrete values:

        use constant { THIS => 0, THAT => 1, THEOTHER => 2, ... TEMPL1 => 'N A10 S', }; my @discrete = unpack TEMPL1, read( $file, $size ); print "THIS:", $discrete[ THIS ];
      • Compact - no great gobs of identical repeated read() and unpack() statements

        For multi-dim arrays, use subroutines:

        sub get2DArray { my( $x, $y, $templ, $templSize, $fh ) = @_; my @array; for my $y ( 0 .. $y - 1 ) { push @array, [ unpack $templ . $x, read( $fh, $templSize * $x +) ]; } return \@array; } my $array2D = get2DArray( 100, 100, 'N', 4, $fh );

        You could use nFor or Loops to write a generic multi-dim array reader, but unless you;re going above 3 or 4 dims, separate subs is probably easier. Watch the iteration order; it's can vary.

      • Fast - need to handle 20GB+ of data here.

        On my system, using a combination of :perlio on the open & binmode gives me the best reading speed. See Re^2: Perl's poor disk IO performance for details. YMMV.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        An alternative is to your first snippet is to use a hash.
        use constant TEMP1 => 'N A10 S'; my %discrete; @discrete{qw( this that theother )} = unpack TEMPL1, read( $file, $size ); print "this:", $discrete{ this };

        The constants are more typo resistant, though.

          I did mention hashes along with my blocks of discrete named vars. The main thing I like about the constant method (besides the typo resistance which is good), is that simplicity of in order iteration. Of course you can get that by putting the hash keys into an array, but once you've done that, you're better off using the package stash rather than lexical hash.


          Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
          "Science is about questioning the status quo. Questioning authority".
          In the absence of evidence, opinion is indistinguishable from prejudice.
    Re: Reading binary files - program structure
    by almut (Canon) on May 19, 2010 at 23:49 UTC

      Although there are various parser generator modules for Perl, they are probably not the best option if speed is of paramount importance.

      Maybe you could use a somewhat simpler state machine instead, i.e a set of variables that you toggle on and off depending on specific tokens you encounter in the file.  They would then indicate what section of the file you're currently in, so you could write something like

      if ($in_section_foo) { if ($in_subsection_bar) { handle_foo_bar(); } ... } ...

      where handle_foo_bar() would read the appropriate number of bytes (the multi-dimesional array) and unpack them according to the pattern that applies for foo/bar.

      It's hard to be more specific without knowing what exact format you're talking about.

    Re: Reading binary files - program structure
    by Anonymous Monk on May 19, 2010 at 23:44 UTC
      A parser probably already exists for that format, so I would go looking for it.