Parsing with regular expressions

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Brothers,

I am trying to parse a file using the regex below. In the case of a failed match, what I would like to do is be able to determine how much of the string was matched before the regex engine gave up so that I can give a useful error message. But since 'pos' only gets set when there is a completed match, the code below will not do what I had hope it would. Do I have to break my regex into multiple pieces and use \G or is there a better way?

Thanks,

Jim


    my $param_rx      = '[^),]+';  
    my $list_start_rx = '\s*\(\s*'; 
    my $list_end_rx   = '\s*\)\s*'; 

    $_ = $stmt;

     /^\s*                             
     (\S+)                             
     \s+                               
      (\S+)                             
     $list_start_rx                    
     ($param_rx(?:\s*,\s*$param_rx)*)? 
     $list_end_rx                      
     =\s*                              
     (?:0x)?\d+               
     \s*;\s*$                          
    /cgxo; 

    if (pos() != length($stmt)) {
      print "\n$stmt\n";
      print ' ' x pos();
      print "^<-- Parse failed here (column " . pos()  . " of " . leng
+th($stmt) . ")\n";

      &error ($ARGV, $line_num, 'Exiting');
    }
[download]

Comment on Parsing with regular expressions Download Code

Replies are listed 'Best First'.
Re: Parsing with regular expressions by Abigail-II (Bishop) on Jul 29, 2003 at 13:53 UTC
That's not at all easily doable, and in a lot of cases, the answer wouldn't be useful at all. It could well be that you have a pattern that matches a string almost, except a 'last' character. But the optimizer could already have noticed that last character isn't in the string, so Perl doesn't even attempt the match. So there wouldn't be any logical value for `pos` to return. Furthermore, if you do: `"doghouse" =~ /doghair\|cathouse/;` [download] How far did that match? 3 characters? Or 0? And what about: `"ababa" =~ /^(?>.*)b/` [download] How far did that "match"? Abigail	[reply] [d/l] [select]
Re: Parsing with regular expressions by halley (Prior) on Jul 29, 2003 at 13:53 UTC
Yes, regex matching is really good at discovering either complete success or no possible success. It's not as good at "chewing up" an input as far as possible then stopping so you can see what to chew up next. It can do it, but it's not obvious what kind of looping structure you'd need for most simple projects. The book, Mastering Regular Expressions, has quite a few real-world and contrived examples of how to do this token-by-token parsing, chewing up the input and accepting different constructs according to state. The module, Parse::RecDescent, has a lot of power in developing parsing logic from a grammar of possible valid inputs. If you've used YACC, you'll find this familiar. If you've not explored such structured grammars, it can be daunting without examples. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: Parsing with regular expressions by fletcher_the_dog (Friar) on Jul 29, 2003 at 16:18 UTC
Here is a way to do it that I have used before for finding tokens in C files use strict; my $param_rx = '[^),]+'; my $list_start_rx = '\s$\s'; my $list_end_rx = '\s$\s'; my @regexes=( qr/^\s/, qr/(\S+)/, qr/\s+/, qr/(\S+)/, qr/$list_start_rx/, qr/($param_rx(?:\s,\s$param_rx))?/, qr/$list_end_rx/, qr/=\s/, qr/(?:0x)?\d+/, qr/\s;\s*$/ ); PARSER: while (<DATA>) { foreach my $regex (@regexes) { if ( not /\G$regex/gc ) { print "^<-- Parse failed on line $. at (column " . pos() . " of + " . length($_) . ")\n"; print '"'.substr($_,0,pos())." HERE>>".substr($_,pos(),-1)."\"\n" +; last PARSER; } } } __DATA__ some list (one, two three) = 5; another list (one, two three) = 5; a bad list (one two (three)) [download] This outputs: `^<-- Parse failed on line 3 at (column 5 of 29) "a bad HERE>> list (one two (three))"` [download]	[reply] [d/l] [select]
Re: Parsing with regular expressions by chunlou (Curate) on Jul 29, 2003 at 14:11 UTC
Not sure it will always work the way you want but you could try to use (?{CODE}) in your regex. $_ = "glad guy sad gal\n" ; while( / (?{print "$`\n"}) (\b.a.\b\|..l) (?{print "\nmatch: $&\n\n"}) /gx ){}; __END__ g gl gla glad glad glad g glad gu glad guy glad guy match: sad glad guy sad glad guy sad match: gal [download]	[reply] [d/l]