in reply to Balancing braces without (??{})

See Parse::YAPP. This implements a full grammar, which is a superset of what can be done with a regex.

Update: I just ran into Text::Balanced, which I suppose does exactly this. It's part of the Parse::RecDescent package. You should already have it if you've installed Switch.

A true regular expression is represented by a connected graph of nodes, and you move from state to state with each input token. It can't count, because it has no memory at all other than knowing what state it's in.

To count braces, you need a "Push-down automata", the next step up. It has a graph and a stack.

So, you could implement a P.D.A. by starting with foreach $char (split ('',$input)) {, switch to the current state, and push/pop a @array for the state. More specifically, push the open symbol so you can check against it for a matching close.

if $char is an open symbol, push it along with position. else if $char is a close symbol, pop the stack and verify that it's the matching type (error if not). popped position through current position is ballanced range. Save +that on answer list.
Error if stack underflows. When out of input, make sure stack is also empty else warning. Return list of ballanced position ranges.

—John

Replies are listed 'Best First'.
Re: Re: Balancing braces without (??{})
by tshabet (Beadle) on Jul 28, 2001 at 01:04 UTC
    Ah, the Push down automata, I knew it would return to haunt me someday. Applied theory with Professor Decker...how I miss it q-; Thanks for this very helpful info John. Actually, my previous solution was on a similar track but instead of implementing a stack to push and pop I was doing something more hackish and just using a variable that would increment for each { and decrement for each } and return the brace that brought the var back to zero. This was a very hackish approach to the problem, and your stack suggestion has obvious advantages aside from its more poetic approach. I'll take a look at Parse::YAPP, sounds like a useful read. Aside from the fact that (??{}) sequence makes the regex code more compact than implementing a stack, are there advantages or disadvantages in terms of runtime, memory, etc? I guess what I'm asking is, if you had perl 5.7 running on your machine and a rainy afternoon to spend writing the best code you could, which technique would you choose? I am very interested in making my code the best it can be, since I hope others will find it useful and want to use it as well. Gotta work on my skills before that hallowed day, though.
      I'd worry about the documentation that states, “WARNING: This extended regular expression feature is considered highly experimental, and may be changed or deleted without notice.”

      So, it might not work at all on Perl 5.8 when it comes out.

      If I had to write a good efficient implementation without using ??{}, I'd probably take what I showed you and do it in chunks rather than one char at a time. That is, chop the string into chunks where each chunk is a single ballancing char, or the whole run between them. Then perform the algorithm on that list, rather than every single char.

      Perhaps there should be a feature that elegantly does P.D.A., not just regexp, without getting into all the complexity of full LALR grammar. Perhaps stacking could be added as regex extenstions without generalized code blocks: a way to push something, and zero-width assertions and subexpressions that can refer to the stack top, and pop.

      —John

      Thinking about it again, I'd make a slight change. Instead of chopping it up first, I'd use a search for brace characters with /g as the condition of a while loop, so the body can examine pos() and push and pop and stuff, and all the uninteresting characters are skipped over.