dHarry has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I seek wisdom for implementing a parser in Perl. I am more an “end user” when it comes to parsers I have limited experience in writing them.

I am thinking about implementing a Parser for ODL (Object Description Language, no not the Object Definition Language). ODL is endorsed by the PDS (Planetary Data System) and is used to describe/define data products, e.g. scientific measurements done by instruments. See below for an example then it becomes clear what I would like to parse. This offers a pretty good and short explanation.

There is some SW around…

But I am really looking for a Perl solution (yes I am biased).

A parser that can read the ODL format and generates events like “start_of_object X”, “keyword_Y” , “end_of_object Z” triggering callback functions. A bit like SAX but then for XMODL I suppose. Then I can build other stuff on top of that like validation of the data or generate other data products from it.

I have looked around on CPAN and the web and performed a Super Search…



Can anyone point me in the right direction, i.e. how to implement an (ODL) parser? Or is there already something around?

Example data file EXAMPLE.LBL

__DATA__

/*** FILE CHARACTERISTIC DATA ELEMENTS ***/ FILE_NAME = "2007_11_11T10_11_47_1.DAT" FILE_RECORDS = 9392 RECORD_TYPE = FIXED_LENGTH RECORD_BYTES = 4264 RELEASE_ID = 0001 REVISION_ID = 0001 /* Table object describing the data */ OBJECT = TABLE COLUMNS = 42 INTERCHANGE_FORMAT = BINARY ROW_BYTES = 4264 ROWS = 9387 DESCRIPTION = "BLABLA” OBJECT = COLUMN COLUMN_NUMBER = 1 NAME = SPECTRUM BYTES = 2048 START_BYTE = 1 DATA_TYPE = MSB_INTEGER DESCRIPTION = " THIS COLUMN CONTAINS THE RAW DATA OF THE 512 CHANNEL X-RAY SPECTRUM OF THE INSTRUMENT. " ITEMS = 512 ITEM_BYTES = 4 END_OBJECT = COLUMN /* more COLUMN objects here */ END_OBJECT = TABLE END

Replies are listed 'Best First'.
Re: Writing an ODL parser?
by moritz (Cardinal) on Jul 29, 2008 at 16:34 UTC

      First of all thanks for the advice!

      This looks like a classical application of a recursive descending parser that I introduced in RFC: Parsing with perl - Regexes and beyond.

      I have difficulty understanding the article. In my humble opinion it is rather theoretical. Maybe I am not worthy ;-)

      In fact I think I wrote a similar parser for a SoPW the other, see Design hints for a file processor. Since some of your tokens stretch multiple files you need a slightly more intelligent tokenizer though.

      The "slightly" is an understatement. Some of my nightmare examples to illustrate the point:

      In general: KEYWORD = value with value being of a certain type. The comments should be ignored and are not part of the value.

      /* A 2-dimensional sequence as the value is being called in ODL */ KEYWORD = ((1,2) (3,4) (5,8) /* some comment */ 9,11))

      /* A set as the value is being called in ODL */ KEYWORD = { RED, BLUE, /* some comment */ GREEN, HAZEL }

      /* A text string spanning multiple lines */ KEYWORD = "some text /* not a comment but part of the value! */ more text even more text" /* this is again a comment*/

        I have difficulty understanding the article. In my humble opinion it is rather theoretical. Maybe I am not worthy ;-)

        No, it probably means that you are not the intended target audience. Or that I did a bad job at writing.

        The "slightly" is an understatement. Some of my nightmare examples to illustrate the point:

        You're right, I underestimated the complexity. I thought you could just take lines, and multiple lines if they contained non-closed quoted strings.

        Still you should not give up hope. I wrote a simple lexer that works for the example you gave:

        use strict; use warnings; use Data::Dumper; use Math::Expression::Evaluator::Lexer qw(lex); my $d = do { local $/; <DATA> }; my @tokens = ( ['Commment', qr{/\*.*?\*/}s, sub { return }], ['Identifier', qr{[a-zA-Z_]\w+}], ['Number', qr{\d+}], ['Operator', qr{[=(),+-/*{}]}], ['Quoted String', qr{"[^"]*"}], ['Newline', qr{\n}], ['Whitespace', qr{\s+}, sub { return }], ); print Dumper lex($d, \@tokens); __DATA__ /* A 2-dimensional sequence as the value is being called in ODL */ KEYWORD = ((1,2) (3,4) (5,8) /* some comment */ 9,11)) /* A set as the value is being called in ODL */ KEYWORD = { RED, BLUE, /* some comment */ GREEN, HAZEL } /* A text string spanning multiple lines */ KEYWORD = "some text /* not a comment but part of the value! */ more text even more text" /* this is again a comment*/

        This is far from ideal, but it does tokenize the data in a meaningful way, and strips comments, but not those inside quoted strings.

        (The lexer in Math::Expression::Evaluator::Lexer is quite simple and not iterator-like. If you don't want to read all input at once, you need to come up with something more sophisticated.)

Re: Writing an ODL parser?
by jethro (Monsignor) on Jul 29, 2008 at 19:00 UTC
    Parse::RecDescent is not that hard to use. It also has that event triggering of callback functions quite nicely integrated.

    For example you would define a table object:

    tableobject : tablestart tabledef(s) tableend { tablefunc($item[1]) } tablestart : 'OBJECT' '=' 'TABLE' "\n" { tablestartfunc() } tabledef : columnnumber | name | bytes | startbyte | columnobject { columnstartfunc() } | rows columnnumber : 'COLUMN_NUMBER' '=' number /\n/ { $return= recordcolumn($item{number}); } number : /\d+/
    In this example tablestartfunc() would be called at the start of a table definition, tablefunc() after the whole table was parsed, recordcolumn() would be called with the number of columns as parameter.

    UPDATE:
    Whether you record the table information with the callbacks or by using the $return mechanism of Parse::RecDescent is your decision. In the former case tablefunc doesn't need the $item[1] parameter and columnnumber() doesn't need to set $return. In the latter case columnnumber would need a more elaborate return value, for example $return= ['columnnumber',$item{number}];

      Thanks for the example/explanation. I will spend some time on Parse::RecDescent to check if I can really use it.

      The example I gave was a bit simple, there are many different types of OBJECTS and the values can be a bit tricky (see my response to the post of moritz above.

        That makes P::RD even better suited. The great advantage of P::RD is that you can and recursively (top-down or bottom-up) specify what is expected. So in your example you just define that keywordata is either a twodimsequence, a set or a text:

        keywordata: twodimsequence | set | text
        How these are defined in detail you can work out later without the complexity of thinking about where they are used. Some time later you might define that a twodimsequence is a list of numberpairs surrounded by brackets:

        twodimsequence: '(' numberpair(s) ')'
        You see, the definition of the language to parse is really easy and straightforward. The difficult part is mostly what to do with the parsed data. If you already know what your callback routines have to do, the only problems left are in the details.

        One detail is how to cope with the comments inside your data files.

        One idea (that might also help you with moritz and pc88mxer solutions) is to do the parsing in two phases. In the first phase you would delete all comments and fold data spread over multiple lines into one line, in the second phase you would to the parsing. Note that you can let more than one parser run over some text with P::RD, so it is well suited for that decoupling of problems

        The other possibility is to define the SKIP expression in P::RD to be sometimes not only spaces but comments and newlines as well.

        Or you just put 'comment(?)' or 'comment(s?)' in all places that a comment can be and define what a comment looks like (obviously no callback needed there). That may clutter your definitions but may be less tricky to get right than the SKIP method. The definition of a comment would probably look like this:

        comment: m{/\* ([^*]|\*[^/])* \*/ }x
        I don't say it is a walk in the park to get that parser running, the problems will be in the details. But you will be able to split the big problem into managable parts.

Re: Writing an ODL parser?
by pc88mxer (Vicar) on Jul 29, 2008 at 19:22 UTC
    You might consider writing a parser which is just "good enough" for the files you want to process. Since structures seem to end on line boundaries, you can try something like this:
    my $root = {}; my @stack; my $object = $root; use Data::Dumper; while (<DATA>) { if (m{^\s*\/\*}) { next } # assume single line comment elsif (m/^\s*END(_OBJECT)?/) { $object = pop(@stack); } elsif (m/^\s*OBJECT\s*=\s*(\S+)/) { my $new = { parent => $object, type = $1 }; push(@{$object->{children}}, $new); push(@stack, $object); $object = $new; } elsif (m/^\s*(\w+)\s*=\s*(\S*)/) { my $property = $1; my $value = $2; if ($value =~ m/^"/) { while ( (($value =~ tr/"//) % 2 != 0) && defined($_ = <DATA>)) { s/^\s*//; $value .= $_; } } $object->{$property} = $value; } } print Dumper($root);
    The string literal parsing is admittedly a little cheesy since I don't know what the rules are for representing literals in ODL files.

      Thanks for the effort!

      You might consider writing a parser which is just "good enough" for the files you want to process. Since structures seem to end on line boundaries, you can try something like this: ...

      Writing something just "good enough" was exactly my first approach:-) But now I want to improve and make it more generic. As you can read in my reply to moritz the values can be a bit tricky.