Reading binary files - program structure

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone have suggestions for how to structure a script to read a binary data file?

The file in question is a results file from a finite element solver, so although it's structured its not as simple as a header + a big chunk of uniform data. Its got a load of scalar variable values then massive (multi-dimensional) arrays, and both the number of words per record and the format vary.

I'm finding it hard to identify an approach which is a good compromise for being:

Fairly self-documenting - e.g. using explicit variable names, which can then match the documentation for the file structure

Compact - no great gobs of identical repeated read() and unpack() statements

Fast - need to handle 20GB+ of data here.

It feels like there should be some sort of lookup or grammar approach which can tell the script what it's got, how to unpack() it, and where to put it, but I'm not sure I know what that approach is...

Comment on Reading binary files - program structure

Replies are listed 'Best First'.
Re: Reading binary files - program structure by BrowserUk (Patriarch) on May 20, 2010 at 01:02 UTC
Fairly self-documenting - e.g. using explicit variable names, which can then match the documentation for the file structure For explicit naming, you'll have to either use my blocks, or hash keys, or `use constant`. For efficiency and compactness, using constant names defined to indexes in an array is a good option for largish numbers of discrete values: `use constant { THIS => 0, THAT => 1, THEOTHER => 2, ... TEMPL1 => 'N A10 S', }; my @discrete = unpack TEMPL1, read( $file, $size ); print "THIS:", $discrete[ THIS ];` [download] Compact - no great gobs of identical repeated read() and unpack() statements For multi-dim arrays, use subroutines: `sub get2DArray { my( $x, $y, $templ, $templSize, $fh ) = @_; my @array; for my $y ( 0 .. $y - 1 ) { push @array, [ unpack $templ . $x, read( $fh, $templSize * $x +) ]; } return \@array; } my $array2D = get2DArray( 100, 100, 'N', 4, $fh );` [download] You could use nFor or Loops to write a generic multi-dim array reader, but unless you;re going above 3 or 4 dims, separate subs is probably easier. Watch the iteration order; it's can vary. Fast - need to handle 20GB+ of data here. On my system, using a combination of `:perlio` on the open & binmode gives me the best reading speed. See Re^2: Perl's poor disk IO performance for details. YMMV. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^2: Reading binary files - program structure by ikegami (Patriarch) on May 20, 2010 at 15:51 UTC
An alternative is to your first snippet is to use a hash. `use constant TEMP1 => 'N A10 S'; my %discrete; @discrete{qw( this that theother )} = unpack TEMPL1, read( $file, $size ); print "this:", $discrete{ this };` [download] The constants are more typo resistant, though.	[reply] [d/l]
Re^3: Reading binary files - program structure by BrowserUk (Patriarch) on May 20, 2010 at 16:01 UTC
I did mention hashes along with my blocks of discrete named vars. The main thing I like about the constant method (besides the typo resistance which is good), is that simplicity of in order iteration. Of course you can get that by putting the hash keys into an array, but once you've done that, you're better off using the package stash rather than lexical hash. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re: Reading binary files - program structure by Anonymous Monk on May 24, 2010 at 17:16 UTC
Re^2: Reading binary files - program structure by BrowserUk (Patriarch) on May 24, 2010 at 20:38 UTC
Some notes below your chosen depth have not been shown here
Re: Reading binary files - program structure by almut (Canon) on May 19, 2010 at 23:49 UTC
Although there are various parser generator modules for Perl, they are probably not the best option if speed is of paramount importance. Maybe you could use a somewhat simpler state machine instead, i.e a set of variables that you toggle on and off depending on specific tokens you encounter in the file. They would then indicate what section of the file you're currently in, so you could write something like `if ($in_section_foo) { if ($in_subsection_bar) { handle_foo_bar(); } ... } ...` [download] where `handle_foo_bar()` would read the appropriate number of bytes (the multi-dimesional array) and unpack them according to the pattern that applies for foo/bar. It's hard to be more specific without knowing what exact format you're talking about.	[reply] [d/l] [select]
Re: Reading binary files - program structure by Anonymous Monk on May 19, 2010 at 23:44 UTC
A parser probably already exists for that format, so I would go looking for it.	[reply]