Re: parsing question

sliles has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Re: parsing question by mstone (Deacon) on Jan 08, 2002 at 01:14 UTC
The general class of problem you're trying to solve is called 'parenthesis balancing', and you can't do it with regular expressions.. they don't have the computing power. Every regular expression defines a set of strings that match the expression. That set is called a 'regular set', and the most convenient way to describe a regular set is to use the corresponding regular expression. Computationally, you can find every member of a regular set using a state machine, also known as a 'finite automaton'. Therefore, regular expressions, regular sets, and finite automata get lumped together as different ways of looking at the same problem. Balanced parentheses are more one step more complicated than a regular set. They're called a 'context-free grammar', or CFG. Again, the grammar defines a set of strings, this time called a 'context-free language' (CFL), and there's a certain type of machine that can find every string in a CFL: a pushdown automaton (PA). There are two general ways to program a PA: recursive functions, or a while() loop with a stack. As far as computing power is concerned, they're identical. For the problem at hand, you'll want a two-layer program consisting of a lexer (a state machine) and a parser (a pushdown automaton). The lexer breaks the input stream down into chunks called 'tokens', then the parser shuffles those tokens around to build a data structure called a 'syntax tree'. Then you walk through the syntax tree to produce output. That's all doable, but it's more complicated than just using a regular expression. Unfortunately, that's as simple as it gets.	[reply]
Re (tilly) 3: parsing question by tilly (Archbishop) on Jan 08, 2002 at 02:26 UTC
What you say is not true in Perl from version 5.6 onwards. The RE engine does have the power to solve this problem. Of course the "Regular" in RE has been a misnomer for ages, but that is another story.	[reply]
Re: Re (tilly) 3: parsing question by mstone (Deacon) on Jan 08, 2002 at 05:43 UTC
> The RE engine does have the power to solve this problem. Hrm.. please demonstrate. I know you can build a lexer by looping over `m/\G$regexp/g`, but that doesn't give you the state storage neccessary to balance parens. If you could whip up something to convert the second-level parens in: `(((())()))(()(()()))` [download] to square brackets: `([(())()])([][()()])` [download] using only regexps.. no variables, no recursion (and for arbitrary strings of nested parens, of course).. I'd love to add the technique to my bag of tricks.	[reply] [d/l] [select]
Re (tilly) 5: parsing question by tilly (Archbishop) on Jan 08, 2002 at 06:23 UTC
(MeowChow) Re4: parsing question by MeowChow (Vicar) on Jan 08, 2002 at 06:02 UTC
Re: (MeowChow) Re4: parsing question by mstone (Deacon) on Jan 09, 2002 at 01:04 UTC
Re: Re: Re: parsing question by mstone (Deacon) on Jan 09, 2002 at 00:20 UTC
As a follow-up, here's a toy compiler that handles balanced parentheses, and demonstrates the basic concepts of lexing input into tokens, parsing the tokens into a syntax tree, and walking the syntax tree to produce output. The lexer uses regular expressions, the parser uses a while() loop and a stack, and the output generator uses recursive functions. I also threw in the ability to handle literals, to make the problem a little more like the original query. -- edit: moved the code below the fold. Read more... (2 kB)	[reply] [d/l]
Re(2): parsing question by dmmiller2k (Chaplain) on Jan 08, 2002 at 01:02 UTC
I'm not sure I'd want to approach this with just a regex matching loop. This seems like it could quickly get out of hand, depending upon how nested the constructs in your "text" files can get. Perhaps you might take a look at using Parse::RecDecent to turn the input file into a parse tree and then re-encode THAT with slightly different rules (i.e., don't use curly braces around a block where there is only a single enclosed statement). dmm	[reply]
Use Parse::RecDescent by johanvdb (Beadle) on Jan 08, 2002 at 15:36 UTC
Hi, I would recomment on using the recursive parser toolkit from Damian Conway. Although you should be able to do it with some regexes and a stack-machine also. A single RE will not work though, as this problem is non trivial. I read in one of the replies that you'll need a lexer and a parser ( execellent reply! ) but they can get very simple. The lexer can be a simple list of regexes and the parser can be a simple array in which you pop, push stuff ( stack ). A good pointer to solve this problem can be: http://www.yapc.org/Europe/2001/proceedings/68/index.html and http://www.samag.com/documents/s=1288/sam03040010/ Hope this helps a bit ... `J.`	[reply] [d/l]
Re: Re: parsing question by thunders (Priest) on Jan 08, 2002 at 05:38 UTC
It's very difficult to do recursive substitutions with a regex alone, tilly says it's possible so i'll take his word for it. But modules like Text::Balanced and Parse::RecDescent are a godsend for this type of thing.	[reply]
Re(2): Re: parsing question by FoxtrotUniform (Prior) on Jan 08, 2002 at 21:46 UTC
I think it's worth repeating what tilly and mstone have mentioned; that is, matching balanced delimiters cannot be done with a (formal) regular expression. Perl's "regexes", however, are more powerful than (formal) regular expressions, and (with some contortions) can match balanced delimiters. For the sake of whoever maintains your code, though, use Text::Balanced instead, and keep the gnarly regexes in reserve for when they're the best tool for the job. `-- :wq`	[reply]