back-n-black has asked for the wisdom of the Perl Monks concerning the following question:

I've never been that good at regular expressions. What I want to do is parse many log entries for words, ultimately, in SQL like expressions.

For example.

$line = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = 'brother' ) or ( is-mother.to code = 'dog' ) )";

What ultimately I need out of these strings are:

my.my id is-relative.to code is-relative.to code is-mother.to code

but something like this would be great!

( my.my id >= 1 ) ( is-relative.to code = 'sister' ) ( is-relative.to code = 'brother' ) ( is-mother.to code = 'dog' )

or

my.my id >= 1 ) is-relative.to code = 'sister' is-relative.to code = 'brother' is-mother.to code = 'dog'

I have been looking a while for hints to an elegant resolution for this problem. There is much dialogue about the use of Text::Balanced but not enough examples in the documentation for my little brain, to help me solve the riddle.

I have an example here that just pulls the expressions, I know what to do from there. I would like some ideas or code examples on a more elegant solution using one of the CPAN modules if that is possible.

What it basically does is:

  1. Split the text at the first close parens
  2. Parse the expression out of this "before" text
  3. Split the "after" text this time, and repeat the above operations

Here is a snippet of code that pulls the expressions

$text = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) an +d ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = 'br +other' ) or ( is-mother.to code = 'dog' ) )"; my $new = $text; while ( 1 ) { $ind = index($new, ')'); # Split the text at the first close parens $before = substr($new,0,$ind); $after = substr($new,$ind); last if ( $before eq "" ); # Clean up the before string # Remove everything up to and including the last open paren # Remove any beginning or trailing spaces $before = substr($before,rindex($before,'(')+1); $before =~ s/^\s+//; $before =~ s/\s+$//; push(@list,$before); if ( $after =~ /\)/ ) { # Disgard chars up to the first open paren $after = substr($after,index($after,'(')+1); $new = $after; print "\n"; } else { last; } } foreach my $i (@list) { print "--".$i."--\n"; }

Replies are listed 'Best First'.
Re: Elegant examples to parse parenthesised strings
by jettero (Monsignor) on May 19, 2010 at 12:11 UTC
    I've never been that good at regular expressions.

    There's a problem here. Traditionally, regular expressions can't match things that count (anbn). Perl RE have things that help with this (things that aren't really DFA-y); but you still probably want to look at things like Text::Balanced.

    -Paul

Re: Elegant examples to parse parenthesised strings
by cdarke (Prior) on May 19, 2010 at 13:06 UTC
    When parsing strings I usually go for the 'brute-force' approach (I'm just that kinda guy), splitting around each char and testing one-at-a-time:
    use feature ":5.10"; use strict; use warnings; my $line = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) + and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = +'brother' ) or ( is-mother.to code = 'dog' ) )"; my $open = 0; my @result = (''); for my $char (split ('',$line)) { given ($char) { when ('(') { $open++ } when (')') { $open--; push @result,'' } default { $result[-1] .= $char if $open } } }
    That gives me:
    my.my id >= 1 is-relative.to code = 'sister' or is-relative.to code = 'brother' or is-mother.to code = 'dog'
    Which is not good enough, so I need some tidy-up:
    for my $line (@result) { $line =~ s/^\s+(?:or\s+)?(.*)[ ><!=]=.*/$1\n/; } @result = grep !/^\s+$/,@result; print @result,"\n";
    Which gives:
    my.my id is-relative.to code is-relative.to code is-mother.to code
    which I think is what you are looking for. Not sure that it is all that elegant though.
Re: Elegant examples to parse parenthesised strings
by Krambambuli (Curate) on May 19, 2010 at 12:48 UTC
    If, as it seems, you need only the innermost parenthesises, something like
    my @exps = $text =~ / \( ( [^\(]*? ) \) /xmg;
    might be good enough for this particular situation.


    Krambambuli
    ---
Re: Elegant examples to parse parenthesised strings
by toolic (Bishop) on May 19, 2010 at 13:02 UTC
Re: Elegant examples to parse parenthesised strings
by Marshall (Canon) on May 19, 2010 at 18:04 UTC
    I guess another fairly brute force approach..

    #!/usr/bin/perl -w use strict; my $line = "05/04/2010 13:09:45 - A - somebody - ( (( my.my id >= 1 )) + ) and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code += 'brother' ) or ( is-mother.to code = 'dog' ) )"; my @terms = $line =~ /\(.*?\)/g; foreach (@terms) { s/.*(\(.*\)).*/$1/; s/^\(\s*//; s/\s*\)$//; print "$_\n"; } __END__ Prints: my.my id >= 1 is-relative.to code = 'sister' is-relative.to code = 'brother' is-mother.to code = 'dog'
Re: Elegant examples to parse parenthesised strings (Parse::Balanced)
by repellent (Priest) on May 24, 2010 at 05:29 UTC
    A rather obscure module I wrote some time ago experiments with using complex data structures to define how a string is to be parsed in a balanced way.

    A list of tokens describes the parsing behavior. Each token could either be a string, a regexp, or an arrayref of tokens. The arrayref is used to define a "balance branch" where its first token is searched for in order to enter the branch, and its last token is searched for to leave the branch. Any string/regexp token in between is skipped over as an "escaped" literal.
    use Data::Dumper; use Parse::Balanced qw(parse_balanced is_balanced); my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) ) +and ( ( is-relative.to code = 'sister' ) or ( is-relative.to code = ' +brother' ) or ( is-mother.to code = 'dog' ) )"; # recursively parse opening and closing parentheses my @tokens; @tokens = ( "(", \@tokens, ")" ); # parse with a single branch my @p = parse_balanced($str, \@tokens); print Dumper \@p; print "String is ", (is_balanced(@p) ? "" : "not "), "balanced.\n"; # break circular reference @tokens = (); __END__ $VAR1 = [ '05/04/2010 13:09:45 - A - somebody - ', [ '(', ' ', [ '(', ' my.my id >= 1 ', ')' ], ' ', ')' ], ' and ', [ '(', ' ', [ '(', ' is-relative.to code = \'sister\' ', ')' ], ' or ', [ '(', ' is-relative.to code = \'brother\' ', ')' ], ' or ', [ '(', ' is-mother.to code = \'dog\' ', ')' ], ' ', ')' ] ]; String is balanced.

    And here's the module:

      Thanks to all for your responses

      As it turns out I had forgotten a case which all methods are having problems solving.

      For example:

      my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) an +d ( is-mother.to code intersects afunc(val1,val2,val3,val4) and ( is- +father.to code intersects bfunc(val1,val2,val3,val4) )";

      Krambambuli - Nice little one liner

      cdarke - I do not have access to feature in this environment and in a very quick look at cpan I could not find what module had feature in it. Sorry, I did not pursue this example.

      Marshell - I liked this, good small brute force example

      repellent - This is really what I was looking for but like the others it is having problems with ( is-mother.to code intersects afunc(val1,val2,val3,val4).

      my $str = "05/04/2010 13:09:45 - A - somebody - ( is-mother.to code in +tersects afunc(val1,val2,val3,val4) )"; $VAR1 = [ '05/04/2010 13:09:45 - A - somebody - ', [ '(', ' is-mother.to code intersects afunc', [ '(', 'val1,val2,val3,val4', ')' ], ' ', ')' ] ]; String is balanced.

      Since it seems that all are having issues with this one scenario I am thinking I will have to combine Parse::Balanced with some brute force operations.

      One idea I have is to just
      • Grab the portions of the line containing this unique pattern(s).
      • Remove it from the line and deal with it seperately.
      • Finish normally with whats left of the resulting line.

      Unless someone can come up with a way to handle the line in it's entirety.

      Result should be my $str = "05/04/2010 13:09:45 - A - somebody - ( ( my.my id >= 1 ) and ( is-mother.to code intersects afunc(val1,val2,val3,val4) and ( i +s-father.to code intersects bfunc(val1,val2,val3,val4) )"; my.my id >= 1 is-mother.to code intersects afunc(val1,val2,val3,val4) is-father.to code intersects bfunc(val1,val2,val3,val4)
        Add an "escape" token for those function calls:
        @tokens = ( "(", qr/(?:[ab]func\([^)]*\))/, \@tokens, ")" );

        Parsing contents in-between the function call parentheses is left as an exercise.

        By the way, your second $str is unbalanced.