Here's a tokenizer for your language as you've defined it so far:
my $text = "{{\\escape\\sequences \\more\\sequences{\\yet\\more}\\agai +n\\some\\more\\sequences Some Data}{\\foo\\bar Some Other Data}}"; printf("%-14s %s\n", 'Token Type', 'Token Value'); printf("%-14s %s\n", '='x14, '='x40); foreach ($text) { m/\G( { )/gcx && do { printf("%-14s %s\n", 'curly, opening' +, $1); redo; }; m/\G( } )/gcx && do { printf("%-14s %s\n", 'curly, closing' +, $1); redo; }; m/\G( \\\w+ )/gcx && do { printf("%-14s %s\n", 'escape', + $1); redo; }; m/\G( [^{}\\]+ )/gcx && do { printf("%-14s %s\n", 'text', "\" +$1\""); redo; }; }
and I can forsee adding more to further subdivide the tokens.
By definition, a token is something that can't be further subdivided.
In reply to Re: Basics of parsing (using RTF as a testbed)
by ikegami
in thread Basics of parsing (using RTF as a testbed)
by Mugatu
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |