comment on

I am trying to learn how to make a basic parser. I am trying to parse RTF documents. I am not trying to make anything fancy, I just want to tokenize the data and iterate on the tokens. Yes, I am aware of RTF::Parser, but the point is not to write something fully functional, but just to learn how to get a basic parser framework in place.

An RTF goes something like this:

{{\escape\sequences \more\sequences{\yet\more}\again\some\more\sequenc
+es Some Data}{\foo\bar Some Other Data}}
[download]

Which, expanded to a more readable form, looks like this:

{
  {
    \escape\sequences
    \more\sequences
    {
      \yet\more
    }
    \again\some\more\sequences
    Some Data
  }
  {
    Some Other Data
    \foo\bar
  }
}
[download]

This may not be perfectly representative of the format, but it gives the general idea. There are groups of things denoted by curly brackets, which can be arbitrarily nested. There are escape sequences, which appear to be arbitrarily placed, and there's the actual data, which also seems to be arbitrarily placed.

My approach to the parser was to first find { and } characters, then get all the stuff between them, and split that up at whitespace boundaries. It seems to work, but feels kind of clunky. I don't like having the two nested loops, and I can forsee adding more to further subdivide the tokens. Here's the code:

use strict;
use warnings;

my $text = "{{\\escape\\sequences \\more\\sequences{\\yet\\more}\\agai
+n\\some\\more\\sequences Some Data}{\\foo\\bar Some Other Data}}";

while ($text =~ /\G([{}])/g) {
    print "Token: '$1'\n";

    if ($text =~ /\G([^{}]*)/g) {
        my $chunk = $1;

        while ($chunk =~ /(\S+|\s+)/g) {
            print "Token: '$1'\n";
        }
    }
}
[download]

So, I ask you, where did I go wrong? Is there a simpler way to iterate on the tokens that I'm just not thinking of? I sure hope so!

In reply to Basics of parsing (using RTF as a testbed) by Mugatu

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.