Looking for ideas: Converting a binary 'flags' field into something human readable

Monk::Thomas has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks

I wrote a parser library for a specific class of binary files (resource files for a video game). It converts the file into a human readable data structure. (hashes of hashes of array of hashes kinda thing; data fields that are only relevant for parsing the binary data streams are stripped from the result)

One of the data types it must be able to handle are 'flags' - a variable length sequence of bytes, where the actual value is uninteresting, the interesting part is whether a certain bit (flag) is set or not, e.g if a record is deleted or compressed or has a certain property. It seems like they are mostly exactly 1, 2, 4 or 8 bytes long, so I could easily use an unsigned integer value. However there are 2 things that bug me:

what to do if there turns up to be a flags field which does not match integers? (e.g. 6 or 10 bytes)
is there a better way to represent the 'flag/bit'-nature of the value?

My ideas:

One could emulate a '6 Byte Flags' field by reading uint32 + uint 16 and then manually calculate the combined integer value. Did anyone say kludge/wart? Yeah. Looks like one.

Other representations I can think of could be 1110111100001 (which could get _extremely_ long) or a hash like:

%flags = (
    'is_deleted'    => 0,       # a known flag
    'is_compressed' => 1,       # another known flag
    '2^15'          => 1,       # a bit that is set but unknown
);
[download]

(unknown bits with value 0 are not listed in order to conserve space)

Your ideas?

Thanks for all your input! I have a bit of a trouble deciding whether I should go with

%flags = (
    'is_deleted'    => 0,       # a known flag
    'is_compressed' => 1,       # another known flag
    '2^15'          => 1,       # a bit that is set but unknown
);
[download]

or with

  flags => 0b00100010000...
  #            | is_deleted
  #                | is_compressed

because both are quite nice. I'm going to try both and see what works best. =) Regarding the parser grammar it becomes obvious that I need a custom data type for flag-fields. Maybe something like:

example:
  # name                specification       expected value
  - [ Flags,            'flags_example',                           ]

flags_example:
  {
    "length": 4,                # length of flags field (in bits or by
+tes)
    "2^2": "is_deleted",        # a known flag
    "2^6": "is_compressed",     # another known flag
    ...
  }
[download]

context

The parser must be able to parse about 120 different 'records'. Since I don't want to hardcode all the different formats the parser is configurable by a YAML-file. A full record description is probably kinda boring, so here is the hex dump for a value, the parser grammar and the actual parsed data:

hex dump:

4B 53 49 5A 04 00 03 00 00 00 4B 57 44 41 0C 00 98 37 01 00 95 37 01 00 6C 2A 09 00

annotated hex dump:

 4B 53 49 5A                           Type           (KSIZ)
 04 00                                 Size           (always 4)
 03 00 00 00                           KwrdCount
 4B 57 44 41                           Type           (KWDA)
 0C 00                                 Size           (4 * KwrdCount)
 98 37 01 00 95 37 01 00 6C 2A 09 00   Keywords       FormID{count}

parser grammar:

example:
  # name                specification       expected value
  - [ type1,            'char[4]',          'x = KSIZ'             ]
  - [ size1,            'uint[2]',          'x = 4', 's = 2'       ]
  - << size1 begin >>
  - [ count,            'uint[size1]',      'x > 0'                ]
  - << size1 end >>
  # -------------------------------------------------------------- #
  - [ type2,            'char[4]',          'x = KWDA'             ]
  - [ size2,            'uint[2]',                                 ]
  - << size2 begin >>
  - [ Keywords,         'uint[4]{count}',   'c > 0'                ]
  - << size2 end >>
[download]

combining hex dump + grammar results in:

...
    example => {
      Keywords => [ '98 37 01 00', '95 37 01 00', '6C 2A 09 00' ],
    }
...

(The output is a bit fudged, because Keywords => [] would actually contain the integer values. But then there would be nothing left resembling the original data, so I left the raw hex dump values.

How to read the parser grammar:

This parser grammer is written in YAML. (Actually the only reason for YAML is the ability to use comments. Strip the comments and it's JSON.)
lines beginning with # are comments and are only provided for documentational purpose
lines beginning with - indicate a parseable item
Square bracketed lines indicate a value to read from the data stream. first column is a suitable value name, second is the actual binary data format, third is optional and (if present) denotes one ore more conditiona that must be met in order for the value to be valid.
value names matching qr/[a-z\d]+/ are relevant only during parsing and are not part of the final result set. If the parsed data needs to be serialized into a data stream again, then these values are either calculated from the input value (3 'Keywords' => count=3) or if they are required to be a certain value they can be taken from the 'expected value' column (type1='KSIZ')
all other values are part of the returned parser result.
'<< (\w+) (begin|end)>>' signify the begin and end for calculating the relevant size value. (nasty: The length of 'size' itself may be a part of the actual value.)

not shown: sub records, alternatives, repeating records, ...

I'm pretty sure this library will end up on CPAN some day, for now I want to keep it private to be able to modify the API (and break backwards compatibility) at will. (And defer finding a suitable name until it's ready for submitting. Current name is File::Parse)

Comment on Looking for ideas: Converting a binary 'flags' field into something human readable Select or Download Code

Replies are listed 'Best First'.
Re: Looking for ideas: Converting a binary 'flags' field into something human readable by BrowserUk (Patriarch) on Jul 07, 2015 at 21:39 UTC
Your ideas? If 8 bytes is the longest field, I think I'd be tempted to display the binary and annotate only those known fields something like this: `flags1 => 0b010011000010000010100000000010000000000000010100000000 +0000000111; # \| compressed # \| deleted # \| this # \| that # \| other # \| something else # \| and another # foo \| # bar \| # + up \| # + down \| # s +ideways \|` [download] Not pretty, but very clear. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply] [d/l]
Re^2: Looking for ideas: Converting a binary 'flags' field into something human readable by bojinlund (Monsignor) on Jul 08, 2015 at 06:00 UTC
Or? `flags1 => 0b010011000010000010100000000010000000000000010100000000 +0000000111; # \| \|\| \|that \| \| \|and another \| \|bar + up\|\|\| # \|this \|something else \|foo + down\| # \|deleted \| other + sideways\| # \|compressed` [download]	[reply] [d/l]
Re^3: Looking for ideas: Converting a binary 'flags' field into something human readable by BrowserUk (Patriarch) on Jul 08, 2015 at 06:47 UTC
I'd like to see the code that determines how to compress those together :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply]
Re^4: Looking for ideas: Converting a binary 'flags' field into something human readable by choroba (Cardinal) on Jul 08, 2015 at 10:29 UTC
Re^5: Looking for ideas: Converting a binary 'flags' field into something human readable by BrowserUk (Patriarch) on Jul 08, 2015 at 11:21 UTC
Re^5: Looking for ideas: Converting a binary 'flags' field into something human readable by Monk::Thomas (Friar) on Jul 08, 2015 at 10:47 UTC
Re^4: Looking for ideas: Converting a binary 'flags' field into something human readable by GotToBTru (Prior) on Jul 08, 2015 at 15:17 UTC
Re^2: Looking for ideas: Converting a binary 'flags' field into something human readable by locked_user sundialsvc4 (Abbot) on Jul 09, 2015 at 00:47 UTC
That seems to me to be a “at least fairly genius” suggestion: to encode the value as the bit-field which it is, but then to document the meaning of the field to the human reader. (The computer wouldn’t care.) Promptly upvoted. The parser-grammar could fairly easily be made to include a descriptor that, one way or another, maps to a (hard-coded) Perl subroutine within the parser ... which generates the comment-entries to describe the field. (Personally, I think I’d do it as a table, including the starting byte/bit number, the number of bits, and the interpretation. “ASCII Art” could be unmanageable.)
Re: Looking for ideas: Converting a binary 'flags' field into something human readable by bitingduck (Deacon) on Jul 10, 2015 at 00:37 UTC
You've already got some code posted and a few suggestions, but when I had to do this a few months ago for a known number of bits, I used a D/A board and LEDs. :D	[reply]