This is just what I was looking for. A basic technique that doesn't use the ugly nested (or, as Ovid says, "ad-hoc") tokenizing that mine has. Many thanks, and I wish I had the power to upvote you for that!
Update: just so there are no misunderstandings, I appreciate all the other replies too. I will look at them and try to glean some useful information. But ikegami's reply was just the kind of basic jump-start I was looking for.
Another update: I've moved the token names and patterns into an array that I loop on, rather than hard coding the logic. I also implemented a simple indentation scheme. Here it is, if anyone is interested:
my @tokens = ( { name => 'group_begin', pattern => qr({) }, { name => 'group_end', pattern => qr(}) }, { name => 'escape', pattern => qr(\\[^\\{}\s]+) }, { name => 'text', pattern => qr([^\\{}]+) }, ); my $indent = 0; TOKENLOOP: { for (@tokens) { if ($text =~ /\G($_->{pattern})/gc) { (my $token = $1) =~ s/\n/[\\n]/g; $indent-- if $token eq "}"; print " " x $indent, "->$token<-\n"; $indent++ if $token eq "{"; redo TOKENLOOP; } } }
In reply to Re^2: Basics of parsing (using RTF as a testbed)
by Mugatu
in thread Basics of parsing (using RTF as a testbed)
by Mugatu
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |