in reply to Basics of parsing (using RTF as a testbed)

Here's a tokenizer for your language as you've defined it so far:

my $text = "{{\\escape\\sequences \\more\\sequences{\\yet\\more}\\agai +n\\some\\more\\sequences Some Data}{\\foo\\bar Some Other Data}}"; printf("%-14s %s\n", 'Token Type', 'Token Value'); printf("%-14s %s\n", '='x14, '='x40); foreach ($text) { m/\G( { )/gcx && do { printf("%-14s %s\n", 'curly, opening' +, $1); redo; }; m/\G( } )/gcx && do { printf("%-14s %s\n", 'curly, closing' +, $1); redo; }; m/\G( \\\w+ )/gcx && do { printf("%-14s %s\n", 'escape', + $1); redo; }; m/\G( [^{}\\]+ )/gcx && do { printf("%-14s %s\n", 'text', "\" +$1\""); redo; }; }
and I can forsee adding more to further subdivide the tokens.

By definition, a token is something that can't be further subdivided.

Replies are listed 'Best First'.
Re^2: Basics of parsing (using RTF as a testbed)
by Mugatu (Monk) on Feb 25, 2005 at 19:21 UTC

    This is just what I was looking for. A basic technique that doesn't use the ugly nested (or, as Ovid says, "ad-hoc") tokenizing that mine has. Many thanks, and I wish I had the power to upvote you for that!

    Update: just so there are no misunderstandings, I appreciate all the other replies too. I will look at them and try to glean some useful information. But ikegami's reply was just the kind of basic jump-start I was looking for.

    Another update: I've moved the token names and patterns into an array that I loop on, rather than hard coding the logic. I also implemented a simple indentation scheme. Here it is, if anyone is interested:

    my @tokens = ( { name => 'group_begin', pattern => qr({) }, { name => 'group_end', pattern => qr(}) }, { name => 'escape', pattern => qr(\\[^\\{}\s]+) }, { name => 'text', pattern => qr([^\\{}]+) }, ); my $indent = 0; TOKENLOOP: { for (@tokens) { if ($text =~ /\G($_->{pattern})/gc) { (my $token = $1) =~ s/\n/[\\n]/g; $indent-- if $token eq "}"; print " " x $indent, "->$token<-\n"; $indent++ if $token eq "{"; redo TOKENLOOP; } } }