comment on

Hello fellow monks

I wrote a parser library for a specific class of binary files (resource files for a video game). It converts the file into a human readable data structure. (hashes of hashes of array of hashes kinda thing; data fields that are only relevant for parsing the binary data streams are stripped from the result)

One of the data types it must be able to handle are 'flags' - a variable length sequence of bytes, where the actual value is uninteresting, the interesting part is whether a certain bit (flag) is set or not, e.g if a record is deleted or compressed or has a certain property. It seems like they are mostly exactly 1, 2, 4 or 8 bytes long, so I could easily use an unsigned integer value. However there are 2 things that bug me:

what to do if there turns up to be a flags field which does not match integers? (e.g. 6 or 10 bytes)
is there a better way to represent the 'flag/bit'-nature of the value?

My ideas:

One could emulate a '6 Byte Flags' field by reading uint32 + uint 16 and then manually calculate the combined integer value. Did anyone say kludge/wart? Yeah. Looks like one.

Other representations I can think of could be 1110111100001 (which could get _extremely_ long) or a hash like:

%flags = (
    'is_deleted'    => 0,       # a known flag
    'is_compressed' => 1,       # another known flag
    '2^15'          => 1,       # a bit that is set but unknown
);
[download]

(unknown bits with value 0 are not listed in order to conserve space)

Your ideas?

Thanks for all your input! I have a bit of a trouble deciding whether I should go with

%flags = (
    'is_deleted'    => 0,       # a known flag
    'is_compressed' => 1,       # another known flag
    '2^15'          => 1,       # a bit that is set but unknown
);
[download]

or with

  flags => 0b00100010000...
  #            | is_deleted
  #                | is_compressed

because both are quite nice. I'm going to try both and see what works best. =) Regarding the parser grammar it becomes obvious that I need a custom data type for flag-fields. Maybe something like:

example:
  # name                specification       expected value
  - [ Flags,            'flags_example',                           ]

flags_example:
  {
    "length": 4,                # length of flags field (in bits or by
+tes)
    "2^2": "is_deleted",        # a known flag
    "2^6": "is_compressed",     # another known flag
    ...
  }
[download]

context

The parser must be able to parse about 120 different 'records'. Since I don't want to hardcode all the different formats the parser is configurable by a YAML-file. A full record description is probably kinda boring, so here is the hex dump for a value, the parser grammar and the actual parsed data:

hex dump:

4B 53 49 5A 04 00 03 00 00 00 4B 57 44 41 0C 00 98 37 01 00 95 37 01 00 6C 2A 09 00

annotated hex dump:

 4B 53 49 5A                           Type           (KSIZ)
 04 00                                 Size           (always 4)
 03 00 00 00                           KwrdCount
 4B 57 44 41                           Type           (KWDA)
 0C 00                                 Size           (4 * KwrdCount)
 98 37 01 00 95 37 01 00 6C 2A 09 00   Keywords       FormID{count}

parser grammar:

example:
  # name                specification       expected value
  - [ type1,            'char[4]',          'x = KSIZ'             ]
  - [ size1,            'uint[2]',          'x = 4', 's = 2'       ]
  - << size1 begin >>
  - [ count,            'uint[size1]',      'x > 0'                ]
  - << size1 end >>
  # -------------------------------------------------------------- #
  - [ type2,            'char[4]',          'x = KWDA'             ]
  - [ size2,            'uint[2]',                                 ]
  - << size2 begin >>
  - [ Keywords,         'uint[4]{count}',   'c > 0'                ]
  - << size2 end >>
[download]

combining hex dump + grammar results in:

...
    example => {
      Keywords => [ '98 37 01 00', '95 37 01 00', '6C 2A 09 00' ],
    }
...

(The output is a bit fudged, because Keywords => [] would actually contain the integer values. But then there would be nothing left resembling the original data, so I left the raw hex dump values.

How to read the parser grammar:

This parser grammer is written in YAML. (Actually the only reason for YAML is the ability to use comments. Strip the comments and it's JSON.)
lines beginning with # are comments and are only provided for documentational purpose
lines beginning with - indicate a parseable item
Square bracketed lines indicate a value to read from the data stream. first column is a suitable value name, second is the actual binary data format, third is optional and (if present) denotes one ore more conditiona that must be met in order for the value to be valid.
value names matching qr/[a-z\d]+/ are relevant only during parsing and are not part of the final result set. If the parsed data needs to be serialized into a data stream again, then these values are either calculated from the input value (3 'Keywords' => count=3) or if they are required to be a certain value they can be taken from the 'expected value' column (type1='KSIZ')
all other values are part of the returned parser result.
'<< (\w+) (begin|end)>>' signify the begin and end for calculating the relevant size value. (nasty: The length of 'size' itself may be a part of the actual value.)

not shown: sub records, alternatives, repeating records, ...

I'm pretty sure this library will end up on CPAN some day, for now I want to keep it private to be able to modify the API (and break backwards compatibility) at will. (And defer finding a suitable name until it's ready for submitting. Current name is File::Parse)

In reply to Looking for ideas: Converting a binary 'flags' field into something human readable by Monk::Thomas

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.