I am trying to learn how to make a basic parser. I am trying to parse RTF documents. I am not trying to make anything fancy, I just want to tokenize the data and iterate on the tokens. Yes, I am aware of RTF::Parser, but the point is not to write something fully functional, but just to learn how to get a basic parser framework in place.
An RTF goes something like this:
{{\escape\sequences \more\sequences{\yet\more}\again\some\more\sequenc +es Some Data}{\foo\bar Some Other Data}}
Which, expanded to a more readable form, looks like this:
{ { \escape\sequences \more\sequences { \yet\more } \again\some\more\sequences Some Data } { Some Other Data \foo\bar } }
This may not be perfectly representative of the format, but it gives the general idea. There are groups of things denoted by curly brackets, which can be arbitrarily nested. There are escape sequences, which appear to be arbitrarily placed, and there's the actual data, which also seems to be arbitrarily placed.
My approach to the parser was to first find { and } characters, then get all the stuff between them, and split that up at whitespace boundaries. It seems to work, but feels kind of clunky. I don't like having the two nested loops, and I can forsee adding more to further subdivide the tokens. Here's the code:
use strict; use warnings; my $text = "{{\\escape\\sequences \\more\\sequences{\\yet\\more}\\agai +n\\some\\more\\sequences Some Data}{\\foo\\bar Some Other Data}}"; while ($text =~ /\G([{}])/g) { print "Token: '$1'\n"; if ($text =~ /\G([^{}]*)/g) { my $chunk = $1; while ($chunk =~ /(\S+|\s+)/g) { print "Token: '$1'\n"; } } }
So, I ask you, where did I go wrong? Is there a simpler way to iterate on the tokens that I'm just not thinking of? I sure hope so!
In reply to Basics of parsing (using RTF as a testbed) by Mugatu
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |