sliles has asked for the wisdom of the Perl Monks concerning the following question:

I have the following text file that I want to modify using Perl (this is just a text file I am reading in): ------------------------------------------------------
select (name) { //want to delete this { brace case(a=b): { // delete this { brace if (d=e) { // keep this { brace for if loop call pgmA; return rc; } // keep this } brace for if loop } // delete this } brace default: { // delete this { brace call pgmB; } // delete this } brace } // change the last } brace to END;
I want to delete some of the left curly braces (but not all) and change the last closing brace } to an "END;" so that the modified text looks like:
select (name) case(a=b): if (d=e) { ----->keep this opening brace in call pgmA; if loop return rc; } ----->keep this closing brace default: call pgmB; END; ----->change last } to END;
Can anyone suggest a solution? I know how to search for patterns, but I cannot figure out how to change the occurrence of some characters but not to change other occurrences of it. Thanks for your time, Susan

Replies are listed 'Best First'.
Re: Re: parsing question
by mstone (Deacon) on Jan 08, 2002 at 01:14 UTC

    The general class of problem you're trying to solve is called 'parenthesis balancing', and you can't do it with regular expressions.. they don't have the computing power.

    Every regular expression defines a set of strings that match the expression. That set is called a 'regular set', and the most convenient way to describe a regular set is to use the corresponding regular expression. Computationally, you can find every member of a regular set using a state machine, also known as a 'finite automaton'. Therefore, regular expressions, regular sets, and finite automata get lumped together as different ways of looking at the same problem.

    Balanced parentheses are more one step more complicated than a regular set. They're called a 'context-free grammar', or CFG. Again, the grammar defines a set of strings, this time called a 'context-free language' (CFL), and there's a certain type of machine that can find every string in a CFL: a pushdown automaton (PA).

    There are two general ways to program a PA: recursive functions, or a while() loop with a stack. As far as computing power is concerned, they're identical.

    For the problem at hand, you'll want a two-layer program consisting of a lexer (a state machine) and a parser (a pushdown automaton). The lexer breaks the input stream down into chunks called 'tokens', then the parser shuffles those tokens around to build a data structure called a 'syntax tree'. Then you walk through the syntax tree to produce output.

    That's all doable, but it's more complicated than just using a regular expression. Unfortunately, that's as simple as it gets.

      What you say is not true in Perl from version 5.6 onwards. The RE engine does have the power to solve this problem.

      Of course the "Regular" in RE has been a misnomer for ages, but that is another story.


        > The RE engine does have the power to solve this problem.

        Hrm.. please demonstrate. I know you can build a lexer by looping over m/\G$regexp/g, but that doesn't give you the state storage neccessary to balance parens. If you could whip up something to convert the second-level parens in:

        (((())()))(()(()()))

        to square brackets:

        ([(())()])([][()()])

        using only regexps.. no variables, no recursion (and for arbitrary strings of nested parens, of course).. I'd love to add the technique to my bag of tricks.

      As a follow-up, here's a toy compiler that handles balanced parentheses, and demonstrates the basic concepts of lexing input into tokens, parsing the tokens into a syntax tree, and walking the syntax tree to produce output. The lexer uses regular expressions, the parser uses a while() loop and a stack, and the output generator uses recursive functions. I also threw in the ability to handle literals, to make the problem a little more like the original query.

      -- edit: moved the code below the fold.

Re(2): parsing question
by dmmiller2k (Chaplain) on Jan 08, 2002 at 01:02 UTC

    I'm not sure I'd want to approach this with just a regex matching loop.

    This seems like it could quickly get out of hand, depending upon how nested the constructs in your "text" files can get. Perhaps you might take a look at using Parse::RecDecent to turn the input file into a parse tree and then re-encode THAT with slightly different rules (i.e., don't use curly braces around a block where there is only a single enclosed statement).

    dmm

Use Parse::RecDescent
by johanvdb (Beadle) on Jan 08, 2002 at 15:36 UTC
    Hi, I would recomment on using the recursive parser toolkit from Damian Conway. Although you should be able to do it with some regexes and a stack-machine also. A single RE will not work though, as this problem is non trivial. I read in one of the replies that you'll need a lexer and a parser ( execellent reply! ) but they can get very simple. The lexer can be a simple list of regexes and the parser can be a simple array in which you pop, push stuff ( stack ). A good pointer to solve this problem can be:
    http://www.yapc.org/Europe/2001/proceedings/68/index.html
    and
    http://www.samag.com/documents/s=1288/sam03040010/

    Hope this helps a bit ...

    J.
Re: Re: parsing question
by thunders (Priest) on Jan 08, 2002 at 05:38 UTC
    It's very difficult to do recursive substitutions with a regex alone, tilly says it's possible so i'll take his word for it. But modules like Text::Balanced and Parse::RecDescent are a godsend for this type of thing.

      I think it's worth repeating what tilly and mstone have mentioned; that is, matching balanced delimiters cannot be done with a (formal) regular expression. Perl's "regexes", however, are more powerful than (formal) regular expressions, and (with some contortions) can match balanced delimiters.

      For the sake of whoever maintains your code, though, use Text::Balanced instead, and keep the gnarly regexes in reserve for when they're the best tool for the job.

      --
      :wq