Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Brothers,

I am trying to parse a file using the regex below. In the case of a failed match, what I would like to do is be able to determine how much of the string was matched before the regex engine gave up so that I can give a useful error message. But since 'pos' only gets set when there is a completed match, the code below will not do what I had hope it would. Do I have to break my regex into multiple pieces and use \G or is there a better way?

Thanks,

Jim

my $param_rx = '[^),]+'; my $list_start_rx = '\s*\(\s*'; my $list_end_rx = '\s*\)\s*'; $_ = $stmt; /^\s* (\S+) \s+ (\S+) $list_start_rx ($param_rx(?:\s*,\s*$param_rx)*)? $list_end_rx =\s* (?:0x)?\d+ \s*;\s*$ /cgxo; if (pos() != length($stmt)) { print "\n$stmt\n"; print ' ' x pos(); print "^<-- Parse failed here (column " . pos() . " of " . leng +th($stmt) . ")\n"; &error ($ARGV, $line_num, 'Exiting'); }

Replies are listed 'Best First'.
Re: Parsing with regular expressions
by Abigail-II (Bishop) on Jul 29, 2003 at 13:53 UTC
    That's not at all easily doable, and in a lot of cases, the answer wouldn't be useful at all. It could well be that you have a pattern that matches a string almost, except a 'last' character. But the optimizer could already have noticed that last character isn't in the string, so Perl doesn't even attempt the match. So there wouldn't be any logical value for pos to return.

    Furthermore, if you do:

    "doghouse" =~ /doghair|cathouse/;

    How far did that match? 3 characters? Or 0? And what about:

    "ababa" =~ /^(?>.*)b/

    How far did that "match"?

    Abigail

Re: Parsing with regular expressions
by halley (Prior) on Jul 29, 2003 at 13:53 UTC
    Yes, regex matching is really good at discovering either complete success or no possible success. It's not as good at "chewing up" an input as far as possible then stopping so you can see what to chew up next. It can do it, but it's not obvious what kind of looping structure you'd need for most simple projects.

    The book, Mastering Regular Expressions, has quite a few real-world and contrived examples of how to do this token-by-token parsing, chewing up the input and accepting different constructs according to state.

    The module, Parse::RecDescent, has a lot of power in developing parsing logic from a grammar of possible valid inputs. If you've used YACC, you'll find this familiar. If you've not explored such structured grammars, it can be daunting without examples.

    --
    [ e d @ h a l l e y . c c ]

Re: Parsing with regular expressions
by fletcher_the_dog (Friar) on Jul 29, 2003 at 16:18 UTC
    Here is a way to do it that I have used before for finding tokens in C files
    use strict; my $param_rx = '[^),]+'; my $list_start_rx = '\s*\(\s*'; my $list_end_rx = '\s*\)\s*'; my @regexes=( qr/^\s*/, qr/(\S+)/, qr/\s+/, qr/(\S+)/, qr/$list_start_rx/, qr/($param_rx(?:\s*,\s*$param_rx)*)?/, qr/$list_end_rx/, qr/=\s*/, qr/(?:0x)?\d+/, qr/\s*;\s*$/ ); PARSER: while (<DATA>) { foreach my $regex (@regexes) { if ( not /\G$regex/gc ) { print "^<-- Parse failed on line $. at (column " . pos() . " of + " . length($_) . ")\n"; print '"'.substr($_,0,pos())." HERE>>".substr($_,pos(),-1)."\"\n" +; last PARSER; } } } __DATA__ some list (one, two three) = 5; another list (one, two three) = 5; a bad list (one two (three))
    This outputs:
    ^<-- Parse failed on line 3 at (column 5 of 29) "a bad HERE>> list (one two (three))"
Re: Parsing with regular expressions
by chunlou (Curate) on Jul 29, 2003 at 14:11 UTC
    Not sure it will always work the way you want but you could try to use (?{CODE}) in your regex.
    $_ = "glad guy sad gal\n" ; while( / (?{print "$`\n"}) (\b.a.\b|..l) (?{print "\nmatch: $&\n\n"}) /gx ){}; __END__ g gl gla glad glad glad g glad gu glad guy glad guy match: sad glad guy sad glad guy sad match: gal