Monk::Thomas has asked for the wisdom of the Perl Monks concerning the following question:
Hello fellow monks
I wrote a parser library for a specific class of binary files (resource files for a video game). It converts the file into a human readable data structure. (hashes of hashes of array of hashes kinda thing; data fields that are only relevant for parsing the binary data streams are stripped from the result)
One of the data types it must be able to handle are 'flags' - a variable length sequence of bytes, where the actual value is uninteresting, the interesting part is whether a certain bit (flag) is set or not, e.g if a record is deleted or compressed or has a certain property. It seems like they are mostly exactly 1, 2, 4 or 8 bytes long, so I could easily use an unsigned integer value. However there are 2 things that bug me:
My ideas:
One could emulate a '6 Byte Flags' field by reading uint32 + uint 16 and then manually calculate the combined integer value. Did anyone say kludge/wart? Yeah. Looks like one.
Other representations I can think of could be 1110111100001 (which could get _extremely_ long) or a hash like:
(unknown bits with value 0 are not listed in order to conserve space)%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
Your ideas?
Thanks for all your input! I have a bit of a trouble deciding whether I should go with
or with%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
flags => 0b00100010000... # | is_deleted # | is_compressed
because both are quite nice. I'm going to try both and see what works best. =) Regarding the parser grammar it becomes obvious that I need a custom data type for flag-fields. Maybe something like:
example: # name specification expected value - [ Flags, 'flags_example', ] flags_example: { "length": 4, # length of flags field (in bits or by +tes) "2^2": "is_deleted", # a known flag "2^6": "is_compressed", # another known flag ... }
context
The parser must be able to parse about 120 different 'records'. Since I don't want to hardcode all the different formats the parser is configurable by a YAML-file. A full record description is probably kinda boring, so here is the hex dump for a value, the parser grammar and the actual parsed data:
hex dump:4B 53 49 5A 04 00 03 00 00 00 4B 57 44 41 0C 00 98 37 01 00 95 37 01 00 6C 2A 09 00
annotated hex dump:4B 53 49 5A Type (KSIZ) 04 00 Size (always 4) 03 00 00 00 KwrdCount 4B 57 44 41 Type (KWDA) 0C 00 Size (4 * KwrdCount) 98 37 01 00 95 37 01 00 6C 2A 09 00 Keywords FormID{count}parser grammar:
example: # name specification expected value - [ type1, 'char[4]', 'x = KSIZ' ] - [ size1, 'uint[2]', 'x = 4', 's = 2' ] - << size1 begin >> - [ count, 'uint[size1]', 'x > 0' ] - << size1 end >> # -------------------------------------------------------------- # - [ type2, 'char[4]', 'x = KWDA' ] - [ size2, 'uint[2]', ] - << size2 begin >> - [ Keywords, 'uint[4]{count}', 'c > 0' ] - << size2 end >>
combining hex dump + grammar results in:
... example => { Keywords => [ '98 37 01 00', '95 37 01 00', '6C 2A 09 00' ], } ...(The output is a bit fudged, because Keywords => [] would actually contain the integer values. But then there would be nothing left resembling the original data, so I left the raw hex dump values.
How to read the parser grammar:
not shown: sub records, alternatives, repeating records, ...
I'm pretty sure this library will end up on CPAN some day, for now I want to keep it private to be able to modify the API (and break backwards compatibility) at will. (And defer finding a suitable name until it's ready for submitting. Current name is File::Parse)
|
---|