lostriver has asked for the wisdom of the Perl Monks concerning the following question:

The wise one, I need to parse many of these:
("TELNET:OVERFLOW:OPTIONS-REPLY" :info ( :english ( :name ("TELNET:OVERFLOW:OPTIONS-REPLY") :long_name ("TELNET Options Overflow (Response +)") :description ("This signature detects attempts + to exploit a known vulnerability against the BSD-based TELNET daemon. The option processing function (telrcv) in the daemon produces responses w +ith a fixed size buffer, but does not perform bounds checking. Attackers c +an send a combination of TELNET protocol options to the daemon to overflo +w the buffer and execute arbitrary commands.") ) ) :color (red) :severity (5) :category (TELNET) :keywords ("telnet dangerous command bin") )
Regexes will not cut it... Will Parse::RecDescent help? Any other way? Thanks.

Replies are listed 'Best First'.
Re: Tricky parsing.
by ikegami (Patriarch) on Mar 17, 2006 at 00:15 UTC

    Quite easily

    my $grammar = <<'__END_OF_GRAMMAR__'; { use strict; use warnings; sub dequote { local $_ = @_ ? $_[0] : $_; s/^"//; s/"$//; s/\\(.)/$1/sg; return $_; } } parse : '(' item ')' /\z/ { $item[2] } item : value child(s?) { [ $item[1], $item[2] ] } | child(s?) { [ undef, $item[1] ] } child : CHILD_NAME '(' item ')' { [ @item[1,3] ] } value : NUMBER | IDENT | QSTRING NUMBER : /[1-9][0-9]*/ IDENT : /[a-zA-Z][a-zA-Z0-9_]*/ QSTRING : /"(?:[^"\\]|\\.)*"/s { dequote($item[1]) } CHILD_NAME : /:[a-zA-Z][a-zA-Z0-9_]*/ { substr($item[1], 1) } __END_OF_GRAMMAR__

    Untested.

    Update: I just noticed the "value" is optional. Fixed.

    Update: Tested. Found a few bugs. Fixed them.

    use Data::Dumper; use Parse::RecDescent (); my $parser = Parse::RecDescent->new($grammar) or die("Bad grammar\n"); my $parse_tree = $parser->parse($text) or die("Bad text\n"); print Dumper $parse_tree;
Re: Tricky parsing.
by saintmike (Vicar) on Mar 17, 2006 at 00:18 UTC
    Text::Balanced will take care of the nested parentheses.

    Update:

    Here's a script to parse your data into a hash of hashes.

    To get access to long_name, for example, just use $href->{info}->{english}->{long_name}:

    use Text::Balanced qw(extract_bracketed); my $rex = qr/(?s):(\w+) (\(.*)/; my $href = {}; my $data = join '', <DATA>; extract($data, $href); print $href->{info}->{english}->{long_name}, "\n"; ########################################### sub extract { ########################################### my($input, $href) = @_; while($input =~ /$rex/) { my($tok, $str) = ($1, $2); my ($extr, $rest) = extract_bracketed($str, '()'); if($extr =~ /$rex/) { $href->{$tok} = {}; extract($extr, $href->{$tok}); } else { $extr =~ s/^\(|\)$//g; $href->{$tok} = $extr; } $input = $rest; }; return $href; } __DATA__ ("TELNET:OVERFLOW:OPTIONS-REPLY" :info ( :english ( :name ("TELNET:OVERFLOW:OPTIONS-REPLY") :long_name ("TELNET Options Overflow (Response)") :description ("This signature detects attempts to exploit a known vulnerability against the BSD-based TELNET daemon. The option processing function (telrcv) in the daemon produces responses with a fixed size buffer, but does not perform bounds checking. Attackers can send a combination of TELNET protocol options to the daemon to overflow the buffer and execute arbitrary commands.") ) ) :color (red) :severity (5) :category (TELNET) :keywords ("telnet dangerous command bin") )
Re: Tricky parsing.
by zer (Deacon) on Mar 17, 2006 at 00:48 UTC
    Parse::RecDescent is also a good source. The book Advanced Perl goes in depth on this type of parsing
Re: Tricky parsing.
by zer (Deacon) on Mar 16, 2006 at 23:58 UTC
    how does this need to be parsed?
      Extract certain fields (this is just a snipet, there is a lot more tags & depth). Say, 'description' => value...