Mugatu has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to learn how to make a basic parser. I am trying to parse RTF documents. I am not trying to make anything fancy, I just want to tokenize the data and iterate on the tokens. Yes, I am aware of RTF::Parser, but the point is not to write something fully functional, but just to learn how to get a basic parser framework in place.

An RTF goes something like this:

{{\escape\sequences \more\sequences{\yet\more}\again\some\more\sequenc +es Some Data}{\foo\bar Some Other Data}}

Which, expanded to a more readable form, looks like this:

{ { \escape\sequences \more\sequences { \yet\more } \again\some\more\sequences Some Data } { Some Other Data \foo\bar } }

This may not be perfectly representative of the format, but it gives the general idea. There are groups of things denoted by curly brackets, which can be arbitrarily nested. There are escape sequences, which appear to be arbitrarily placed, and there's the actual data, which also seems to be arbitrarily placed.

My approach to the parser was to first find { and } characters, then get all the stuff between them, and split that up at whitespace boundaries. It seems to work, but feels kind of clunky. I don't like having the two nested loops, and I can forsee adding more to further subdivide the tokens. Here's the code:

use strict; use warnings; my $text = "{{\\escape\\sequences \\more\\sequences{\\yet\\more}\\agai +n\\some\\more\\sequences Some Data}{\\foo\\bar Some Other Data}}"; while ($text =~ /\G([{}])/g) { print "Token: '$1'\n"; if ($text =~ /\G([^{}]*)/g) { my $chunk = $1; while ($chunk =~ /(\S+|\s+)/g) { print "Token: '$1'\n"; } } }

So, I ask you, where did I go wrong? Is there a simpler way to iterate on the tokens that I'm just not thinking of? I sure hope so!

Replies are listed 'Best First'.
Re: Basics of parsing (using RTF as a testbed)
by ikegami (Patriarch) on Feb 25, 2005 at 19:12 UTC

    Here's a tokenizer for your language as you've defined it so far:

    my $text = "{{\\escape\\sequences \\more\\sequences{\\yet\\more}\\agai +n\\some\\more\\sequences Some Data}{\\foo\\bar Some Other Data}}"; printf("%-14s %s\n", 'Token Type', 'Token Value'); printf("%-14s %s\n", '='x14, '='x40); foreach ($text) { m/\G( { )/gcx && do { printf("%-14s %s\n", 'curly, opening' +, $1); redo; }; m/\G( } )/gcx && do { printf("%-14s %s\n", 'curly, closing' +, $1); redo; }; m/\G( \\\w+ )/gcx && do { printf("%-14s %s\n", 'escape', + $1); redo; }; m/\G( [^{}\\]+ )/gcx && do { printf("%-14s %s\n", 'text', "\" +$1\""); redo; }; }
    and I can forsee adding more to further subdivide the tokens.

    By definition, a token is something that can't be further subdivided.

      This is just what I was looking for. A basic technique that doesn't use the ugly nested (or, as Ovid says, "ad-hoc") tokenizing that mine has. Many thanks, and I wish I had the power to upvote you for that!

      Update: just so there are no misunderstandings, I appreciate all the other replies too. I will look at them and try to glean some useful information. But ikegami's reply was just the kind of basic jump-start I was looking for.

      Another update: I've moved the token names and patterns into an array that I loop on, rather than hard coding the logic. I also implemented a simple indentation scheme. Here it is, if anyone is interested:

      my @tokens = ( { name => 'group_begin', pattern => qr({) }, { name => 'group_end', pattern => qr(}) }, { name => 'escape', pattern => qr(\\[^\\{}\s]+) }, { name => 'text', pattern => qr([^\\{}]+) }, ); my $indent = 0; TOKENLOOP: { for (@tokens) { if ($text =~ /\G($_->{pattern})/gc) { (my $token = $1) =~ s/\n/[\\n]/g; $indent-- if $token eq "}"; print " " x $indent, "->$token<-\n"; $indent++ if $token eq "{"; redo TOKENLOOP; } } }
Re: Basics of parsing (using RTF as a testbed)
by sauoq (Abbot) on Feb 25, 2005 at 18:37 UTC

    Are you familiar with Text::Balanced? I think you'll find it useful...

    As for more general advice, you really should have a specification (rather than a "general idea") of the language you are trying to parse before trying to parse it, even if you just have to recreate it from the sample you've got. It gives you something definite to work with. Otherwise, every time you fix your code for a new example, you'll be unsure of whether it still works on previous examples. If you fix the spec first and then code to it, you won't have that issue.

    -sauoq
    "My two cents aren't worth a dime.";
    
      Are you familiar with Text::Balanced? I think you'll find it useful...

      Indeed it would be useful to me here, except I want to learn the techniques myself. I am not doing this for any purpose but to learn the mechanics of a decent parser (or I should say tokenizer, since that's all I'm really doing).

Re: Basics of parsing (using RTF as a testbed)
by Ovid (Cardinal) on Feb 25, 2005 at 18:48 UTC

    One way to learn parsing is to study a working example. First, you need to have a grammar to go by. Without a grammar, parsing becomes an ad-hoc affair. For one example of a parser, you can see AI::Prolog::Parser. That provides the parsing tools used in AI::Prolog::Term::_new_from_parser and AI::Prolog::TermList::_new_from_parser. The grammar is described in AI::Prolog::Builtins. By seeing how the parser provides the tools for parsing and how the Term and TermList modules use the parser to parse themselves, you'll have a fairly decent understanding of one way to parse. Note that my examples are ported from a Java app, so the parsing is much lower level than you are likely to experience in Perl.

    Admittedly, this parser is specifically for Prolog programs, but Prolog is easy enough to parse that it's a fairly simple example.

    Cheers,
    Ovid

    New address of my CGI Course.

Re: Basics of parsing (using RTF as a testbed)
by hardburn (Abbot) on Feb 25, 2005 at 18:48 UTC

    Recursion tends to be a much easier approach to parsing than iteration. An informal pseudo-code grammar might be:

    word character: any character matching /[A-Za-z]/ sequence: starts with '\', then one or more word characters section: one or more sequences, or something that starts with '{', then a section, then '}'

    Note the "then a section" part of 'section' is recursive. This allows sections to be embedded inside other sections. The above is quite a bit wordier than a formal grammar defintion would be (as pseduo-code often is), but should be more understandable if you haven't played with context-free grammars before.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: Basics of parsing (using RTF as a testbed)
by stvn (Monsignor) on Feb 25, 2005 at 20:09 UTC
    Mugatu

    It is helpful (at least it was for me) to think of a parser as a pipelined process.

    First you tokenize the string, once you have broken all your text into the right bits, you pass that to the lexical analyzer (aka - lexer).

    The lexer will then process the tokens and analyze them determining their "type". Bascially token1 is a string, token2 is an operator, token3 is a bracket, etc etc etc.

    Once you have a set of properly classified tokens, you can then build an abstract syntax tree to represent their structure. This results in what is commonly called a "parse tree". If you are familiar with the XML/HTML-DOM, those are basically parse tree's of the XML/HTML documents.

    At this point, you have your parse tree, and the parsing is completed. Now of course you need to figure out what to actually do with that parse tree :)

    Now, the process I described is not the only way to parse, many parser do all this in one step, or combine a a couple steps together (tokenizing and lexical analysis are commonly combined together). But breaking it down into these steps was what helped me to learn how to write parsers. Hope this helps.

    -stvn
Re: Basics of parsing (using RTF as a testbed)
by dragonchild (Archbishop) on Feb 25, 2005 at 18:52 UTC
    I would look at how tilly handles parsing in Text::xSV. It's very different from the Parse::RecDescent type of parsers and may not be applicable to what you're working on. But, it may be useful. :-)

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.