Re: Basics of parsing (using RTF as a testbed)

Here's a tokenizer for your language as you've defined it so far:

my $text = "{{\\escape\\sequences \\more\\sequences{\\yet\\more}\\agai
+n\\some\\more\\sequences Some Data}{\\foo\\bar Some Other Data}}";

printf("%-14s  %s\n", 'Token Type', 'Token Value');
printf("%-14s  %s\n", '='x14,       '='x40);

foreach ($text) {
   m/\G( {        )/gcx && do { printf("%-14s  %s\n", 'curly, opening'
+,  $1);  redo; };
   m/\G( }        )/gcx && do { printf("%-14s  %s\n", 'curly, closing'
+,  $1);  redo; };
   m/\G( \\\w+    )/gcx && do { printf("%-14s  %s\n", 'escape',       
+   $1);  redo; };
   m/\G( [^{}\\]+ )/gcx && do { printf("%-14s  %s\n", 'text',      "\"
+$1\"");  redo; };
}
[download]

and I can forsee adding more to further subdivide the tokens.

By definition, a token is something that can't be further subdivided.

Comment on Re: Basics of parsing (using RTF as a testbed) Download Code

Replies are listed 'Best First'.
Re^2: Basics of parsing (using RTF as a testbed) by Mugatu (Monk) on Feb 25, 2005 at 19:21 UTC
This is just what I was looking for. A basic technique that doesn't use the ugly nested (or, as Ovid says, "ad-hoc") tokenizing that mine has. Many thanks, and I wish I had the power to upvote you for that! Update: just so there are no misunderstandings, I appreciate all the other replies too. I will look at them and try to glean some useful information. But ikegami's reply was just the kind of basic jump-start I was looking for. Another update: I've moved the token names and patterns into an array that I loop on, rather than hard coding the logic. I also implemented a simple indentation scheme. Here it is, if anyone is interested: `my @tokens = ( { name => 'group_begin', pattern => qr({) }, { name => 'group_end', pattern => qr(}) }, { name => 'escape', pattern => qr(\\[^\\{}\s]+) }, { name => 'text', pattern => qr([^\\{}]+) }, ); my $indent = 0; TOKENLOOP: { for (@tokens) { if ($text =~ /\G($_->{pattern})/gc) { (my $token = $1) =~ s/\n/[\\n]/g; $indent-- if $token eq "}"; print " " x $indent, "->$token<-\n"; $indent++ if $token eq "{"; redo TOKENLOOP; } } }` [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Basics of parsing (using RTF as a testbed)
by Mugatu (Monk) on Feb 25, 2005 at 19:21 UTC

This is just what I was looking for. A basic technique that doesn't use the ugly nested (or, as Ovid says, "ad-hoc") tokenizing that mine has. Many thanks, and I wish I had the power to upvote you for that!

Update: just so there are no misunderstandings, I appreciate all the other replies too. I will look at them and try to glean some useful information. But ikegami's reply was just the kind of basic jump-start I was looking for.

Another update: I've moved the token names and patterns into an array that I loop on, rather than hard coding the logic. I also implemented a simple indentation scheme. Here it is, if anyone is interested:

my @tokens = (
    { name => 'group_begin', pattern => qr({)            },
    { name => 'group_end',   pattern => qr(})            },
    { name => 'escape',      pattern => qr(\\[^\\{}\s]+) },
    { name => 'text',        pattern => qr([^\\{}]+)     },
);

my $indent = 0;

TOKENLOOP: {
    for (@tokens) {
        if ($text =~ /\G($_->{pattern})/gc) {
            (my $token = $1) =~ s/\n/[\\n]/g;
            $indent-- if $token eq "}";
            print "  " x $indent, "->$token<-\n";
            $indent++ if $token eq "{";
            redo TOKENLOOP;
        }
    }
}
[download]

[reply]
[d/l]