Monk::Thomas has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks

I wrote a parser library for a specific class of binary files (resource files for a video game). It converts the file into a human readable data structure. (hashes of hashes of array of hashes kinda thing; data fields that are only relevant for parsing the binary data streams are stripped from the result)

One of the data types it must be able to handle are 'flags' - a variable length sequence of bytes, where the actual value is uninteresting, the interesting part is whether a certain bit (flag) is set or not, e.g if a record is deleted or compressed or has a certain property. It seems like they are mostly exactly 1, 2, 4 or 8 bytes long, so I could easily use an unsigned integer value. However there are 2 things that bug me:

My ideas:

One could emulate a '6 Byte Flags' field by reading uint32 + uint 16 and then manually calculate the combined integer value. Did anyone say kludge/wart? Yeah. Looks like one.

Other representations I can think of could be 1110111100001 (which could get _extremely_ long) or a hash like:

%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
(unknown bits with value 0 are not listed in order to conserve space)

Your ideas?

Thanks for all your input! I have a bit of a trouble deciding whether I should go with

%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
or with
  flags => 0b00100010000...
  #            | is_deleted
  #                | is_compressed

because both are quite nice. I'm going to try both and see what works best. =) Regarding the parser grammar it becomes obvious that I need a custom data type for flag-fields. Maybe something like:

example: # name specification expected value - [ Flags, 'flags_example', ] flags_example: { "length": 4, # length of flags field (in bits or by +tes) "2^2": "is_deleted", # a known flag "2^6": "is_compressed", # another known flag ... }

context

The parser must be able to parse about 120 different 'records'. Since I don't want to hardcode all the different formats the parser is configurable by a YAML-file. A full record description is probably kinda boring, so here is the hex dump for a value, the parser grammar and the actual parsed data:

hex dump:

4B 53 49 5A 04 00 03 00 00 00 4B 57 44 41 0C 00 98 37 01 00 95 37 01 00 6C 2A 09 00

annotated hex dump:
 4B 53 49 5A                           Type           (KSIZ)
 04 00                                 Size           (always 4)
 03 00 00 00                           KwrdCount
 4B 57 44 41                           Type           (KWDA)
 0C 00                                 Size           (4 * KwrdCount)
 98 37 01 00 95 37 01 00 6C 2A 09 00   Keywords       FormID{count}
parser grammar:
example: # name specification expected value - [ type1, 'char[4]', 'x = KSIZ' ] - [ size1, 'uint[2]', 'x = 4', 's = 2' ] - << size1 begin >> - [ count, 'uint[size1]', 'x > 0' ] - << size1 end >> # -------------------------------------------------------------- # - [ type2, 'char[4]', 'x = KWDA' ] - [ size2, 'uint[2]', ] - << size2 begin >> - [ Keywords, 'uint[4]{count}', 'c > 0' ] - << size2 end >>

combining hex dump + grammar results in:

...
    example => {
      Keywords => [ '98 37 01 00', '95 37 01 00', '6C 2A 09 00' ],
    }
...    
(The output is a bit fudged, because Keywords => [] would actually contain the integer values. But then there would be nothing left resembling the original data, so I left the raw hex dump values.

How to read the parser grammar:

not shown: sub records, alternatives, repeating records, ...

I'm pretty sure this library will end up on CPAN some day, for now I want to keep it private to be able to modify the API (and break backwards compatibility) at will. (And defer finding a suitable name until it's ready for submitting. Current name is File::Parse)

Replies are listed 'Best First'.
Re: Looking for ideas: Converting a binary 'flags' field into something human readable
by BrowserUk (Patriarch) on Jul 07, 2015 at 21:39 UTC
    Your ideas?

    If 8 bytes is the longest field, I think I'd be tempted to display the binary and annotate only those known fields something like this:

    flags1 => 0b010011000010000010100000000010000000000000010100000000 +0000000111; # | compressed # | deleted # | this # | that # | other # | something else # | and another # foo | # bar | # + up | # + down | # s +ideways |

    Not pretty, but very clear.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      Or?

      flags1 => 0b010011000010000010100000000010000000000000010100000000 +0000000111; # | || |that | | |and another | |bar + up||| # |this |something else |foo + down| # |deleted | other + sideways| # |compressed

        I'd like to see the code that determines how to compress those together :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      That seems to me to be a “at least fairly genius” suggestion:   to encode the value as the bit-field which it is, but then to document the meaning of the field to the human reader.   (The computer wouldn’t care.)

      Promptly upvoted.

      The parser-grammar could fairly easily be made to include a descriptor that, one way or another, maps to a (hard-coded) Perl subroutine within the parser ... which generates the comment-entries to describe the field.   (Personally, I think I’d do it as a table, including the starting byte/bit number, the number of bits, and the interpretation.   “ASCII Art” could be unmanageable.)

Re: Looking for ideas: Converting a binary 'flags' field into something human readable
by bitingduck (Deacon) on Jul 10, 2015 at 00:37 UTC

    You've already got some code posted and a few suggestions, but when I had to do this a few months ago for a known number of bits, I used a D/A board and LEDs.

    :D