Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Parsing nested parentheses

by JeffR100 (Acolyte)
on Nov 18, 2003 at 17:04 UTC ( [id://308039]=perlquestion: print w/replies, xml ) Need Help??

JeffR100 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to find a way to use a regular expression to sort through nested parentheses and return everything within them surroundign a perticular element. Basically, I want to extract everything within the minimum matching parentheses surrounding a particular element. For example, I want to be able to input:

((A,B),C,(D,E)) and the letter C and have the program the program return (A,B),C,(D,E)), but if I enter D, I only want (D,E) returned. Is there a simple way to do this? I have looked at the CPAN modules for matching nested parentheses, but none of them are able to look for a specific string within the parentheses.

Thank You for your help.

Jeff

Replies are listed 'Best First'.
Re: Parsing nested parentheses
by Abigail-II (Bishop) on Nov 18, 2003 at 19:55 UTC
    Once you know how to write a regex to match a string with balanced parens, it's not to hard to write a regex to do what you want.

    How does a string with balanced parens look like? It starts with an opening paren, then "junk" separated by strings with balanced parens, where "junk" is a string consisting of something that isn't a paren. Quite like a delimited string with escaped delimiters actually.

    Now, how do we go from string with balanced parens to the requested string? Simple. We start with an opening paren, then junk string separated by strings with balanced parens, then "C", then junk strings separated by strings with balanced parens, and finally a closing paren.

    So much for text. Here's the code:

    #!/usr/bin/perl use strict; use warnings; no warnings qw /syntax/; my $bal; $bal = qr /[(] [^()]* (?:(??{ $bal }) [^()]* )* [)]/x; my $re = qr /[(] [^()]* (?:(??{ $bal }) [^()]* )* C [^()]* (?:(??{ $bal }) [^()]* )* [)]/x; while (<DATA>) { chomp; print "$_: "; print $& if /$re/; print "\n"; } __DATA__ ((A, B), C, (D, E)) (((A, B), C, (D, E))) ((A, B), D, (C, E)) ((A, C), D, (B, E)) (((A, C)), D, (B, E)) ((A, (C)), D, (B, E)) A, B, C, D, E ((A, B), F, (D, E)) ((A, B), C, (D, E)): ((A, B), C, (D, E)) (((A, B), C, (D, E))): ((A, B), C, (D, E)) ((A, B), D, (C, E)): (C, E) ((A, C), D, (B, E)): (A, C) (((A, C)), D, (B, E)): (A, C) ((A, (C)), D, (B, E)): (C) A, B, C, D, E: ((A, B), F, (D, E)):

    (I cooked up this regex while I was walking home from the train station. It turned out to work at the first try).

    Abigail

      (I cooked up this regex while I was walking home from the train station. It turned out to work at the first try).
      Amazing. The most constructive thing I think about on the way home from the train station is if I will be able to get my "welcome home" beer open on the first try.
Re: Parsing nested parentheses
by hardburn (Abbot) on Nov 18, 2003 at 17:20 UTC

    AFAIK, CPAN modules for matching balanced text are only going to return what is in the parens, not filter out information inside them. You would use the module to grab the text, then parse it yourself. A few modules I can think of are Text::Balanced and Regexp::Common::balanced.

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

Re: Parsing nested parentheses
by diotalevi (Canon) on Nov 18, 2003 at 17:17 UTC

    Simple? Perhaps. For the moment have a gander at Maximal match in a recursive regex which is the bare minimum required if your going to merely capture maximum. My first thought is to follow the $re check with a code block that can check the just-captured result for the contents of your search.

    for ( A .. E ) { print "$_: "; print '((A, B), C, (D, E))' =~ match( $_ ); # Returns '(D, E)' print "\n"; } sub match { my $search = shift; my $re; eval qq[ \$re = qr/ \\( # Opening parenthese ((?: # Capture the contents [^()]+ # Body of a link | (??{\$re}) # Or recurse )+) # and allow repeats internally \\) # Closing parenthese # Test that the just-closed capture block contains $search +. (?(?{ -1 != index \$^N, \$search })(?=)|(?!)) /x; ] or die $@; return $re; }

      That doesn't give any result when 'C' is the argument.

        Oh too bad. It "should". There's also some funny action with stuff being cached. For kicks, try a for ( A .. E ) { ... =~ match( $_ ) } and you'll notice that the expression never shifts off of its initial search.
Re: Parsing nested parentheses
by TheDamian (Vicar) on Nov 18, 2003 at 20:15 UTC
    This is reasonably straightforward with Regexp::Common, though you do have to "roll-your-own" a little:
    use Regexp::Common; my $balanced = qr/[^()]+|$RE{balanced}{-parens=>'()'}/; sub extract { my ($want, $from) = @_; my $nested = qr/$balanced* \Q$want\E $balanced*/x; $from =~ m/( \( $nested \) | \Q$want\E )/x; return $1; } my $expr = '(((B,G),A,(A,B)),C,(D,(E))),F'; for ('A'..'H') { print "$_: ", extract($_,$expr),"\n"; }
    Note that I've assumed that, in the edge case where the target is unnested (e.g. 'F' in the above example), the extraction should just return target itself.

    Also note the standard regexish "left-most-longest-match" behaviour when the target appears more than once (as do targets 'A' and 'B' in the example).

      I tried this out, but it give a segmentation fault on the line: $from =~ m/( \( $nested \) | \Q$want\E )/x; Any ideas?
        Any ideas?
        Not without a little more information. What version of perl are you using? What version of Regexp::Common?
Re: Parsing nested parentheses
by fletcher_the_dog (Friar) on Nov 18, 2003 at 19:46 UTC
    Here is a simple way to do it by keeping track of how deep you are into parenthesis' and what depth your match occured at:
    use strict; my $str = "((A,B),C,(D,E))"; foreach my $m (qw(A B C D E)) { print $m." => ".FindMinParens($m,$str)."\n"; } sub FindMinParens{ my $match = shift; my $string = shift; my @start_pos; my $depth = -1; # keep matching parens or $match while ($string=~/([()]|$match)/g) { if ($1 eq "(") { # record opening paren positions push @start_pos,pos($string); } elsif ($1 eq ")") { # if we reached the closing parens for the minimum pair # get the sub string and exit if ($#start_pos == $depth) { my $start = $start_pos[-1]; my $len = pos($string) - $start -1; return substr($string,$start,$len); }else{ pop @start_pos; } } else { # store depth of $match $depth = $#start_pos; } } return ""; } __OUTPUT__ A => A,B B => A,B C => (A,B),C,(D,E) D => D,E E => D,E
Re: Parsing nested parentheses
by shotgunefx (Parson) on Nov 18, 2003 at 18:29 UTC
    This is probably a klunky solution, but it reminded me of a parser I had written.. (Evaluate Expressions.)

    Quickly hacking at it and only testing a few simple cases (you've been warned)

    You could use the returned structure to search for whatever criteria you desire. I'm sure someone here has a much more elegant solution, but my regex abilities are rather limited...
    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $input = '(a,b,(c,d,(e,f,g)))'; print Dumper parse_expression($input); sub parse_expression { my $exp = shift; my @tokens = (); $exp=~s/\s*([()])\s*/ $1 /go; # Get tokens push @tokens, $1 while $exp=~/\G\s*(".*?")/gc or $exp=~/\G\s*('.*? +')/gc or $exp=~/\G\s*(\S+)/gc; # Find any parens. my (@lp,@rp) = (); for (my $p =0; $p < @tokens; $p++){ if ($tokens[$p] eq '('){ push @lp,$p; }elsif($tokens[$p] eq ')'){ push @rp,$p; } } if ( @lp != @rp){ warn "Mismatched parens in expression.\n"; return; } my @temp = @tokens; for (my $i=0; $i < @rp; $i++){ my @wanted; for (my $j = $#lp; $j >= 0 ; $j--){ if ( defined $lp[$j] && $lp[$j] < $rp[$i] ){ (undef,@wanted) = @tokens[ $lp[$j] .. ($rp[$i] - 1 ) +] ; @tokens[ $lp[$j] .. ($rp[$i]) ] = [ grep {defined $_ +} @wanted]; push @temp, map {split /\s*,\s*/} @wanted; $lp[$j] = $rp[$i] = undef; last; } } } return $tokens[0]; } __DATA__ # OUTPUT $VAR1 = [ 'a,b,', [ 'c,d,', [ 'e,f,g' ] ] ];


    -Lee

    "To be civilized is to deny one's nature."
      Hi there, ever thought of using the perl internal functions to import a nested data structure to a LoL? If you do not have to avoid using eval, you could do it real simple like this:
      #!/usr/bin/perl -w use strict; use Data::Dumper; my $string = "(a,b,(c,d,(e,f,g)))"; $string =~ s/([^,\(\)]+)/"$1"/gsm; # quote content $string =~ tr/\(\)/\[\]/; # replace braces my @LoL; # define list eval("\@LoL = $test;"); # fill LoL print Dumper @LoL; __DATA__ Output: $VAR1 = [ 'a', 'b', [ 'c', 'd', [ 'e', 'f', 'g' ] ] ];
      It is not extensively tested, but should work for most items. - frank
Re: Parsing nested parentheses
by Anonymous Monk on Nov 18, 2003 at 17:55 UTC

    One way using regular expressions:

    my $expr = '((A,B),C,(D,E))'; print extract('C',$expr),"\n"; print extract('B',$expr),"\n"; print extract('D',$expr),"\n"; sub extract { my $char = shift; my ($str,$dup) = (shift) x 2; 1 while $dup =~ s/\([^()$char]*\)/'.' x length $&/e; $dup =~ m/(\([^()$char]*$char[^()]*\))/; return substr($str,$-[1],length $1); } __END__ # output: ((A,B),C,(D,E)) (A,B) (D,E)

    But you probably want to build a small expression parser.

      Being an ancient C programmer, I would probably recurse through the string, inspecting a character at a time. A "(" would mean a call to myself with the remainder of the string. Other characters (except for ")") I would load up into a string, which would be all the stuff within a (). No, better, load them into separate strings, starting a new string after each comma. A ")" would mean returning, after inspecting the above string(s) for a match. If you find a match, I guess you print out that list, comma delimited just like it was in the original string? Or you could just print the substring (of the original string) just inspected. And yes, you could use a regex for the comparison, but you could just use "eq" to compare the strings, since the commas and parens would already be gone. To a true Perl programmer, this is probably brute force and ignorance, but I've used this approach before and it can work.
Re: Parsing nested parentheses
by welchavw (Pilgrim) on Nov 19, 2003 at 17:53 UTC

    This will not work if all tokens are not the same length, because of the reliance on index, which seems ok given the sample dataset provided. Also, its probably not the best approach, but I believe it works and I had fun producing it.

    use strict; use warnings; my $val = 'X'; my $min_len_subgroup; sub chk { my $subgroup = shift; if(-1 != index($subgroup, $val)) { if (!defined $min_len_subgroup || length $subgroup < length $min_l +en_subgroup) { $min_len_subgroup = $subgroup; } } } our $re = qr{ ( \( (?: (?> [^()]+ ) | (??{ $re }) )* \) ) (?{ chk($1) }) }x; my @pats = ( "(1(2(X)))", "(1(2(X,a))) (3(X,e,f))", "(1(2(X,b,c))) (3(X,f))", "(1(2(X,a))) (X,b)", "(1(2(X,a))) (X,b) X", "(1(2(X,a))) (X,b) (X)", ); for my $pat (@pats) { undef $min_len_subgroup; if (() = ($pat =~ /$re/g)) { print "$min_len_subgroup\n"; } }
    ,welchavw
Re: Parsing nested parentheses
by Anonymous Monk on Nov 21, 2003 at 07:21 UTC
    my $expr = '((A,B),C,(D,E))'; print extract('C',$expr),"\n"; print extract('B',$expr),"\n"; print extract('D',$expr),"\n"; sub extract { my $char = shift; local $_ = shift; (my $re=$_)=~s/((\()|(\))|.)/$2\Q$1\E$3/gs; return (sort{length $a <=> length $b}(grep/$char/,eval{/$re/},$@ & +& die $@))[0]; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://308039]
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-16 21:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found