in reply to Re: parsing question

The general class of problem you're trying to solve is called 'parenthesis balancing', and you can't do it with regular expressions.. they don't have the computing power.

Every regular expression defines a set of strings that match the expression. That set is called a 'regular set', and the most convenient way to describe a regular set is to use the corresponding regular expression. Computationally, you can find every member of a regular set using a state machine, also known as a 'finite automaton'. Therefore, regular expressions, regular sets, and finite automata get lumped together as different ways of looking at the same problem.

Balanced parentheses are more one step more complicated than a regular set. They're called a 'context-free grammar', or CFG. Again, the grammar defines a set of strings, this time called a 'context-free language' (CFL), and there's a certain type of machine that can find every string in a CFL: a pushdown automaton (PA).

There are two general ways to program a PA: recursive functions, or a while() loop with a stack. As far as computing power is concerned, they're identical.

For the problem at hand, you'll want a two-layer program consisting of a lexer (a state machine) and a parser (a pushdown automaton). The lexer breaks the input stream down into chunks called 'tokens', then the parser shuffles those tokens around to build a data structure called a 'syntax tree'. Then you walk through the syntax tree to produce output.

That's all doable, but it's more complicated than just using a regular expression. Unfortunately, that's as simple as it gets.

Replies are listed 'Best First'.
Re (tilly) 3: parsing question
by tilly (Archbishop) on Jan 08, 2002 at 02:26 UTC
    What you say is not true in Perl from version 5.6 onwards. The RE engine does have the power to solve this problem.

    Of course the "Regular" in RE has been a misnomer for ages, but that is another story.


      > The RE engine does have the power to solve this problem.

      Hrm.. please demonstrate. I know you can build a lexer by looping over m/\G$regexp/g, but that doesn't give you the state storage neccessary to balance parens. If you could whip up something to convert the second-level parens in:

      (((())()))(()(()()))

      to square brackets:

      ([(())()])([][()()])

      using only regexps.. no variables, no recursion (and for arbitrary strings of nested parens, of course).. I'd love to add the technique to my bag of tricks.

        You just changed the problem massively. I didn't promise that the RE engine magically has become able to handle arbitrary parsing problems. I never said that no variables were involved. I merely said that the RE engine can handle balanced delimiters.

        The two experimental features which are needed are (??{}) for delayed evaluation and (?>) for telling the engine not to backtrack. A sample script demonstrating the technique:

        #! /usr/bin/perl my $braces; $braces = qr/(\((?:(?>[^\(\)]+)|(??{$braces}))*\))/; while (<>) { if ($_ =~ $braces) { print "Matched '$1'\n\n"; } else { print "No match\n\n"; } }
        Run it and start typing in lines. The ones with balanced parentheses will match.
        One example of this sort of thing is here. Look up perlre, and search the text for "postponed".
           MeowChow                                   
                       s aamecha.s a..a\u$&owag.print
Re: Re: Re: parsing question
by mstone (Deacon) on Jan 09, 2002 at 00:20 UTC

    As a follow-up, here's a toy compiler that handles balanced parentheses, and demonstrates the basic concepts of lexing input into tokens, parsing the tokens into a syntax tree, and walking the syntax tree to produce output. The lexer uses regular expressions, the parser uses a while() loop and a stack, and the output generator uses recursive functions. I also threw in the ability to handle literals, to make the problem a little more like the original query.

    -- edit: moved the code below the fold.