Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Best way to a parse a big binary file

by Dirk80 (Pilgrim)
on Nov 29, 2019 at 21:12 UTC ( [id://11109453]=perlquestion: print w/replies, xml ) Need Help??

Dirk80 has asked for the wisdom of the Perl Monks concerning the following question:

I want to parse a big binary file which is grouped into sections (Main Header, Extension Header, PartOne, PartTwo, ...). At the beginning of each section, there is a field which knows the length of this section. At the end I want to read the content of the binary file into objects of corresponding classes, i.e. the main header section of the binary file shall be stored in a blessed hash within package My:MainHeader, the extension header section in a blessed hash within package My::ExtensionHeader, ... .

I'm unsure what is the best way to do it. My suggestions:

  • Reading the whole file into memory (scalar variable), and then passing a reference to the scalar variable, an offset and the number of bytes to the constructor of each class (My::MainHeader, ...).
  • Opening the file, then passing the file handle to the constructor of the My::MainHeader class. After the constructor of this class finished reading the data from file, then pass this file handle to the constructor of My::ExtensionHeader, ...

What would you recommend and why?

Replies are listed 'Best First'.
Re: Best way to a parse a big binary file
by haukex (Archbishop) on Nov 29, 2019 at 21:40 UTC

    As always TIMTOWTDI, but personally I'd implement it as if I was reading a stream, i.e. with read. Either you give each class the ability to parse the binary representation of itself, or you implement the reading as a single function that just constructs objects of the corresponding class (this would make it easier if the sections aren't entirely independent of each other, like if you've got a checksum over multiple sections). I've chosen the latter here:

    use warnings; use 5.012; # for "package BLOCK" syntax package MainHeader { use Moo; has foo => ( is => 'ro' ); } package ExtHeader { use Moo; has bar => ( is => 'ro' ); } package Section { use Moo; has type => ( is => 'ro' ); has quz => ( is => 'ro' ); } my $binfile = "\x01\x06Hello\0" ."\x02\x04\0\x12\x34\x56" ."\x13\x02\xBE\xEF"; sub binread { # helper function for read+unpack my ($fh, $bytes, $templ) = @_; read($fh, my $data, $bytes) == $bytes or die "failed to read $bytes bytes"; return unpack($templ, $data); } open my $fh, '<:raw', \$binfile or die $!; my @packets; while (!eof($fh)) { my ($type, $length) = binread($fh, 2, "CC"); if ($type == 0x01) { my ($foo) = binread($fh, $length, 'Z*'); my $hdr = MainHeader->new( foo => $foo ); push @packets, $hdr; } elsif ($type == 0x02) { my ($bar) = binread($fh, $length, 'N'); my $exthdr = ExtHeader->new( bar => $bar ); push @packets, $exthdr; } elsif ($type == 0x13) { my ($quz) = binread($fh, $length, 'n'); my $sect = Section->new( type => $type, quz => $quz ); push @packets, $sect; } else { die "Unknown packet type $type" } } close $fh; use Data::Dump; dd @packets; __END__ ( bless({ foo => "Hello" }, "MainHeader"), bless({ bar => 1193046 }, "ExtHeader"), bless({ quz => 48879, type => 19 }, "Section"), )

      Thank you very much for your example. I was thinking a lot about this task. The headers and sections are big. Each part contains a lot of fields of different things, e.g. single bytes, integers, floats, doubles, some strings, ... . Because of this I would prefer to have the parse logic in each class.

      If I understand it right, I would open the file in the main package, then passing the lexical file handle to the constructor of the first class. Parsing the things for e.g. the MainHeader. Then passing the file handle to the ExtHeader class, ... ? Or is it a bad style to pass a file handle to a constructor?

      I have no checksum issue. But would be interesting what would you recommend if I had to compute a checksum over all sections although I would like to have the parse logic in each class instead of one central place

      Thanks again in advance for your suggestions. I (nearly) always find a way to do my things with perl. But I want to learn how to solve my tasks with a better design and in a better way. That's why I'm asking.

        Or is it a bad style to pass a file handle to a constructor?

        No, it's fine, as long as you're using lexical filehandles (open my $fh ...). It only gets difficult if any code that is reading from the filehandle either needs to look back at something that was already read from the file, or needs to look ahead further into the file into a section that is supposed to be parsed by another piece of code - in cases like that, it's usually more appropriate to use an approach similar to what I showed above.

        Because of this I would prefer to have the parse logic in each class.

        Sure, that's fine too. Here's one quick example*:

        But would be interesting what would you recommend if I had to compute a checksum over all sections although I would like to have the parse logic in each class instead of one central place

        Well, if by that you mean you want to checksum the entire file, then probably the above sub parsefile is a good place, perhaps devising a way to keep track of the bytes already read or computing the checksum while reading - like for example, an object that wraps the filehandle and exposes the binread method I showed to read from the file could calculate the checksum as the file is read piece by piece. But in my experience it's more common to see checksums on a per-packet basis, in which case, in the above code, Packet::parse could take over the checksum reading and checking.

        But I want to learn how to solve my tasks with a better design and in a better way.

        * There are a whole bunch of possible variations on the above code. For example, I could've used Moo's features like BUILDARGS to have the constructor do the parsing, instead of a separate sub parse (although the former solution makes it a little more tricky to create packets in code that haven't been parsed from a file). Or, I could have structured the classes differently: If this was like a network protocol and each "packet" has a header, then it would make sense for the class Packet to have a header field that is populated with a corresponding class, instead of having the *Header classes be subclasses of Packet. Or I could've defined a role that requires each class to have a parse method. And so on.

        So in general, the usual software design principles apply: reduce repetition, design your OO "isa" and "has" relationships in a sensible manner, make judicious use of factory methods, and so on. If you feel like something is getting too difficult, then it's best to step back and see if there might be some architecture changes that would help the situation, instead of plowing on, because the more code you write, the more reluctant you'll be to make larger architecture changes.

Re: Best way to a parse a big binary file
by pwagyi (Monk) on Dec 02, 2019 at 08:27 UTC

    I would not recommend reading whole file into memory since you said it would be big binary file :) I would have classes for each type MainHeader, ExtensionHeader, PartOne, PartTwo, etc. Each class constructor would take binary data as parameter.

    When you say each section has a field that knows length, does it mean it is Tag-Length-Value? https://en.wikipedia.org/wiki/Type-length-value in that case, you could have main loop in parser class and pass each record binary chunk(from length) to appropriate class based on tag.

    Parser class would be something like iterable; where client invoke next()/ (or ->() in perl land) method to advance/get next record from file.

    #pseudo code # error handling omitted! sub parser_factory { my $file_path = shift; my %options = @_; fh = open_file($file_path) my $iterator = sub { # closure ; flag end of data/closing file omitted while( header = read(fh)) { length = get_length(header) body = read(fh,length) type = get_type(header) class =get_record_class(header) # return class name return class->new(body); } } return $iterator }
Re: Best way to a parse a big binary file
by Anonymous Monk on Nov 30, 2019 at 01:54 UTC

    Hi

    A scalar variable ain't nothing but a filehandle :) (see open )

    So a filehandle you should read :) but it can be a small sampel file Data::Dump::dd()ed

    I don't like reading pack docs so ReadBytes ReadFloat ReadUInt64 ..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11109453]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2024-04-25 16:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found