Elegant examples to parse parenthesised strings

back-n-black has asked for the wisdom of the Perl Monks concerning the following question:

I've never been that good at regular expressions. What I want to do is parse many log entries for words, ultimately, in SQL like expressions.

For example.

$line = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = 'brother' ) or ( is-mother.to code = 'dog' ) )";

What ultimately I need out of these strings are:

my.my id 
is-relative.to code 
is-relative.to code 
is-mother.to code
[download]

but something like this would be great!

( my.my id >= 1 )
( is-relative.to code = 'sister' )
( is-relative.to code = 'brother' )
( is-mother.to code = 'dog' )
[download]

my.my id >= 1 )
is-relative.to code = 'sister'
is-relative.to code = 'brother'
is-mother.to code = 'dog'
[download]

I have been looking a while for hints to an elegant resolution for this problem. There is much dialogue about the use of Text::Balanced but not enough examples in the documentation for my little brain, to help me solve the riddle.

I have an example here that just pulls the expressions, I know what to do from there. I would like some ideas or code examples on a more elegant solution using one of the CPAN modules if that is possible.

What it basically does is:

Split the text at the first close parens
Parse the expression out of this "before" text

Remove everything up to and including the last open paren
Remove any beginning or trailing spaces

Split the "after" text this time, and repeat the above operations

Here is a snippet of code that pulls the expressions

$text = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) an
+d ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = 'br
+other' ) or ( is-mother.to code = 'dog' ) )";

my $new = $text;
while ( 1 ) {
   $ind = index($new, ')');

   # Split the text at the first close parens
   $before = substr($new,0,$ind);
   $after = substr($new,$ind);
   last if ( $before eq "" );

   # Clean up the before string
   #  Remove everything up to and including the last open paren
   #  Remove any beginning or trailing spaces
   $before = substr($before,rindex($before,'(')+1);
   $before =~ s/^\s+//;
   $before =~ s/\s+$//;
   push(@list,$before);

   if ( $after =~ /\)/ ) {
      # Disgard chars up to the first open paren
      $after = substr($after,index($after,'(')+1);
      $new = $after;
      print "\n";
  } else {
     last;
  }
}

foreach my $i (@list) {
   print "--".$i."--\n";
}
[download]

Comment on Elegant examples to parse parenthesised strings Select or Download Code

Replies are listed 'Best First'.
Re: Elegant examples to parse parenthesised strings by jettero (Monsignor) on May 19, 2010 at 12:11 UTC
I've never been that good at regular expressions. There's a problem here. Traditionally, regular expressions can't match things that count (aⁿbⁿ). Perl RE have things that help with this (things that aren't really DFA-y); but you still probably want to look at things like Text::Balanced. -Paul	[reply]
Re: Elegant examples to parse parenthesised strings by cdarke (Prior) on May 19, 2010 at 13:06 UTC
When parsing strings I usually go for the 'brute-force' approach (I'm just that kinda guy), splitting around each char and testing one-at-a-time: `use feature ":5.10"; use strict; use warnings; my $line = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) + and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = +'brother' ) or ( is-mother.to code = 'dog' ) )"; my $open = 0; my @result = (''); for my $char (split ('',$line)) { given ($char) { when ('(') { $open++ } when (')') { $open--; push @result,'' } default { $result[-1] .= $char if $open } } }` [download] That gives me: `my.my id >= 1 is-relative.to code = 'sister' or is-relative.to code = 'brother' or is-mother.to code = 'dog'` [download] Which is not good enough, so I need some tidy-up: `for my $line (@result) { $line =~ s/^\s+(?:or\s+)?(.)[ ><!=]=./$1\n/; } @result = grep !/^\s+$/,@result; print @result,"\n";` [download] Which gives: `my.my id is-relative.to code is-relative.to code is-mother.to code` [download] which I think is what you are looking for. Not sure that it is all that elegant though.	[reply] [d/l] [select]
Re: Elegant examples to parse parenthesised strings by Krambambuli (Curate) on May 19, 2010 at 12:48 UTC
If, as it seems, you need only the innermost parenthesises, something like `my @exps = $text =~ / $ ( [^\(]*? ) $ /xmg;` [download] might be good enough for this particular situation. Krambambuli ---	[reply] [d/l]
Re: Elegant examples to parse parenthesised strings by toolic (Bishop) on May 19, 2010 at 13:02 UTC
There is much dialogue about the use of Text::Balanced but not enough examples in the documentation There are Text::Balanced code examples if you follow the MANIFEST link on CPAN and look at the tests (`t/02_extbrk.t`, etc.). There are more code examples here at the Monastery if you use Super Search where title contains "text::balanced" (returns 12 hits). Another search where any text contains "text::balanced" retruns many more hits.	[reply] [d/l]
Re: Elegant examples to parse parenthesised strings by Marshall (Canon) on May 19, 2010 at 18:04 UTC
I guess another fairly brute force approach.. `#!/usr/bin/perl -w use strict; my $line = "05/04/2010 13:09:45 - A - somebody - ( (( my.my id >= 1 )) + ) and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code += 'brother' ) or ( is-mother.to code = 'dog' ) )"; my @terms = $line =~ /$.?$/g; foreach (@terms) { s/.($.$)./$1/; s/^$\s//; s/\s$$//; print "$_\n"; } __END__ Prints: my.my id >= 1 is-relative.to code = 'sister' is-relative.to code = 'brother' is-mother.to code = 'dog'` [download]	[reply] [d/l]
Re: Elegant examples to parse parenthesised strings (Parse::Balanced) by repellent (Priest) on May 24, 2010 at 05:29 UTC
A rather obscure module I wrote some time ago experiments with using complex data structures to define how a string is to be parsed in a balanced way. A list of tokens describes the parsing behavior. Each token could either be a string, a regexp, or an arrayref of tokens. The arrayref is used to define a "balance branch" where its first token is searched for in order to enter the branch, and its last token is searched for to leave the branch. Any string/regexp token in between is skipped over as an "escaped" literal. use Data::Dumper; use Parse::Balanced qw(parse_balanced is_balanced); my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) +and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = ' +brother' ) or ( is-mother.to code = 'dog' ) )"; # recursively parse opening and closing parentheses my @tokens; @tokens = ( "(", \@tokens, ")" ); # parse with a single branch my @p = parse_balanced($str, \@tokens); print Dumper \@p; print "String is ", (is_balanced(@p) ? "" : "not "), "balanced.\n"; # break circular reference @tokens = (); __END__ $VAR1 = [ '05/04/2010 13:09:45 - A - somebody - ', [ '(', ' ', [ '(', ' my.my id >= 1 ', ')' ], ' ', ')' ], ' and ', [ '(', ' ', [ '(', ' is-relative.to code = \'sister\' ', ')' ], ' or ', [ '(', ' is-relative.to code = \'brother\' ', ')' ], ' or ', [ '(', ' is-mother.to code = \'dog\' ', ')' ], ' ', ')' ] ]; String is balanced. [download] And here's the module: Read more... (4 kB)	[reply] [d/l] [select]
Re^2: Elegant examples to parse parenthesised strings (Parse::Balanced) by back-n-black (Initiate) on May 26, 2010 at 14:36 UTC
Thanks to all for your responses As it turns out I had forgotten a case which all methods are having problems solving. For example: `my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) an +d ( is-mother.to code intersects afunc(val1,val2,val3,val4) and ( is- +father.to code intersects bfunc(val1,val2,val3,val4) )";` [download] Krambambuli - Nice little one liner cdarke - I do not have access to feature in this environment and in a very quick look at cpan I could not find what module had feature in it. Sorry, I did not pursue this example. Marshell - I liked this, good small brute force example repellent - This is really what I was looking for but like the others it is having problems with ( is-mother.to code intersects afunc(val1,val2,val3,val4). `my $str = "05/04/2010 13:09:45 - A - somebody - ( is-mother.to code in +tersects afunc(val1,val2,val3,val4) )"; $VAR1 = [ '05/04/2010 13:09:45 - A - somebody - ', [ '(', ' is-mother.to code intersects afunc', [ '(', 'val1,val2,val3,val4', ')' ], ' ', ')' ] ]; String is balanced.` [download] Since it seems that all are having issues with this one scenario I am thinking I will have to combine Parse::Balanced with some brute force operations. One idea I have is to just Grab the portions of the line containing this unique pattern(s). Remove it from the line and deal with it seperately. Finish normally with whats left of the resulting line. Unless someone can come up with a way to handle the line in it's entirety. `Result should be my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) and ( is-mother.to code intersects afunc(val1,val2,val3,val4) and ( i +s-father.to code intersects bfunc(val1,val2,val3,val4) )"; my.my id >= 1 is-mother.to code intersects afunc(val1,val2,val3,val4) is-father.to code intersects bfunc(val1,val2,val3,val4)` [download]	[reply] [d/l] [select]
Re^3: Elegant examples to parse parenthesised strings (Parse::Balanced) by repellent (Priest) on May 26, 2010 at 22:44 UTC
Add an "escape" token for those function calls: `@tokens = ( "(", qr/(?:[ab]func$[^)]*$)/, \@tokens, ")" );` [download] Parsing contents in-between the function call parentheses is left as an exercise. By the way, your second `$str` is unbalanced.	[reply] [d/l] [select]