Hello fellow monks

I wrote a parser library for a specific class of binary files (resource files for a video game). It converts the file into a human readable data structure. (hashes of hashes of array of hashes kinda thing; data fields that are only relevant for parsing the binary data streams are stripped from the result)

One of the data types it must be able to handle are 'flags' - a variable length sequence of bytes, where the actual value is uninteresting, the interesting part is whether a certain bit (flag) is set or not, e.g if a record is deleted or compressed or has a certain property. It seems like they are mostly exactly 1, 2, 4 or 8 bytes long, so I could easily use an unsigned integer value. However there are 2 things that bug me:

My ideas:

One could emulate a '6 Byte Flags' field by reading uint32 + uint 16 and then manually calculate the combined integer value. Did anyone say kludge/wart? Yeah. Looks like one.

Other representations I can think of could be 1110111100001 (which could get _extremely_ long) or a hash like:

%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
(unknown bits with value 0 are not listed in order to conserve space)

Your ideas?

Thanks for all your input! I have a bit of a trouble deciding whether I should go with

%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
or with
  flags => 0b00100010000...
  #            | is_deleted
  #                | is_compressed

because both are quite nice. I'm going to try both and see what works best. =) Regarding the parser grammar it becomes obvious that I need a custom data type for flag-fields. Maybe something like:

example: # name specification expected value - [ Flags, 'flags_example', ] flags_example: { "length": 4, # length of flags field (in bits or by +tes) "2^2": "is_deleted", # a known flag "2^6": "is_compressed", # another known flag ... }

context

The parser must be able to parse about 120 different 'records'. Since I don't want to hardcode all the different formats the parser is configurable by a YAML-file. A full record description is probably kinda boring, so here is the hex dump for a value, the parser grammar and the actual parsed data:

hex dump:

4B 53 49 5A 04 00 03 00 00 00 4B 57 44 41 0C 00 98 37 01 00 95 37 01 00 6C 2A 09 00

annotated hex dump:
 4B 53 49 5A                           Type           (KSIZ)
 04 00                                 Size           (always 4)
 03 00 00 00                           KwrdCount
 4B 57 44 41                           Type           (KWDA)
 0C 00                                 Size           (4 * KwrdCount)
 98 37 01 00 95 37 01 00 6C 2A 09 00   Keywords       FormID{count}
parser grammar:
example: # name specification expected value - [ type1, 'char[4]', 'x = KSIZ' ] - [ size1, 'uint[2]', 'x = 4', 's = 2' ] - << size1 begin >> - [ count, 'uint[size1]', 'x > 0' ] - << size1 end >> # -------------------------------------------------------------- # - [ type2, 'char[4]', 'x = KWDA' ] - [ size2, 'uint[2]', ] - << size2 begin >> - [ Keywords, 'uint[4]{count}', 'c > 0' ] - << size2 end >>

combining hex dump + grammar results in:

...
    example => {
      Keywords => [ '98 37 01 00', '95 37 01 00', '6C 2A 09 00' ],
    }
...    
(The output is a bit fudged, because Keywords => [] would actually contain the integer values. But then there would be nothing left resembling the original data, so I left the raw hex dump values.

How to read the parser grammar:

not shown: sub records, alternatives, repeating records, ...

I'm pretty sure this library will end up on CPAN some day, for now I want to keep it private to be able to modify the API (and break backwards compatibility) at will. (And defer finding a suitable name until it's ready for submitting. Current name is File::Parse)


In reply to Looking for ideas: Converting a binary 'flags' field into something human readable by Monk::Thomas

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.