zemane has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I want to create a regex object that can be used to match zero or more VHDL comments. A VHDL comment starts with '--' and continues until the end of the line or file.

my $c = qr/(?>--[^\n]*(?:\n|\z))/; # VHDL comment regex
I want to use $c with different quantifiers, like $c, $c?, $c+ and $c*, depending on context.
# position 01234567 89 my $str = "a --b x\n x";
I want the following code to match at positions 2-9:
$str =~ m/$c* x/;
But the following should not match at all:
$str =~ m/$c*x/;
Unfortunately, when the regex fails to match 'x' the engine bumps along to a position inside the comment (which I'd like to skip) and eventually matches at position 6.

Any idea how to fix this?

Replies are listed 'Best First'.
Re: How do I avoid regex engine bumping along inside an atomic pattern?
by tilly (Archbishop) on Aug 23, 2008 at 04:31 UTC
    If there was no x in the comment the second one would still match at the final x. Could it be that the pattern you want is something like $str =~ m/(^|$c) x/? Because even if there wasn't an x in the comment in the latter example, your pattern would still match the final x. Because it is an x preceeded by 0 comments.

    In general you have a more basic problem. Which is that you're trying to use regular expressions for parsing, which they are poorly suited for. Instead you want to use regular expressions for tokenizing, and then move parsing logic into code. The basic trick for that is to use pos and the \G assertion liberally within regular expressions using the /g modifier.

      Hi, If there was no x in the comment, the second test would still fail because there is no ' x' to match (note the blank space before the x). But I agree with you that I am trying to do too much with regular expression. I believe I can do the following:
      my $c = qr/(?>\s|--[^\n]*(?:\n|\z))/; # one whitespace or one comm +ent # later on, when parsing... pos($str) = 0; if ($str =~ m/a/gc) { print "found a\n" } else { print "missing a\n" } $str =~ m/$c*/gc; # skip any comments, whitespaces if ($str =~ m/x/gc) { print "found x\n" } else { print "missing x\n" }
      I am not sure if I need set pos($str) to 0 at the beginning. And I am not sure if I need to use \G when parsing.

      But again, thanks for your ideas!

        You don't need to set pos($str) to 0 at the beginning - it is automatically undef which does the same thing. However you do need to reset it after every failed match before you try to match again.

        But you do need to use \G or else you get your original problem. Using a \G at the start of your RE says, "Does this match right where I left off?" Leaving it out means, "Search from where I left off to find where it matches." So the latter will search ahead and find matches inside comments. The former can have the logic to know whether it is inside a comment or not. The latter does not.

        About the second test, I suspect you didn't say exactly what you meant to say in the original question...

Re: How do I avoid regex engine bumping along inside an atomic pattern?
by toolic (Bishop) on Aug 23, 2008 at 02:18 UTC
    I'm not really following your problem, but here are some ideas that may be of use to you.

    Since you are parsing strings with multiple newline characters, perhaps the /m modifier would be of use. Search for //m in perlretut.

    You may have special characters in your regex, so you may need to use the \Q..\E escape sequences.

    Perhaps the CPAN module, Hardware::Vhdl::Lexer, does something similar to what you are looking for. I have not used it, but poking through the source code may give you clues if/how the author parses VHDL comments.

Re: How do I avoid regex engine bumping along inside an atomic pattern?
by jethro (Monsignor) on Aug 23, 2008 at 02:18 UTC

    You could look for non-comments after a line beginning instead of comments before a line ending:

    my $c= qr/(?:\A|\n)([^-]|-[^-])/; $str =~ m/$c* x/;

    (Yes, that regex could use some forward lookahead, but I always have to look up the syntax so I'll leave that as an exercise.)

    Generally such constructs are easier to parse if you only look at one line per string

Re: How do I avoid regex engine bumping along inside an atomic pattern?
by AnomalousMonk (Archbishop) on Aug 23, 2008 at 03:36 UTC
    Offhand, I don't see any straightforward way to modify the regex object $c to achieve the result you want. The hard fact is, there is an 'x' in $str at offset 6 that is preceded by zero or more $c.

    I think the simplest approach is to use an additional anchor for the match. Since $c consumes everything up to and including the newline, if you begin the match of whatever follows $c with a start-of-(embedded)-line anchor, you get the desired results for the example string given.

    perl -wMstrict -le "my $c = qr{ (?> -- [^\n]* (?:\n|\z)) }xms; my $str = qq(a --b x\n x); print qq(match at positions $-[1]-), $+[1]-1 if $str =~ m{ ($c* [ ] x) }xms; " match at positions 2-9 perl -wMstrict -le "my $c = qr{ (?> -- [^\n]* (?:\n|\z)) }xms; my $str = qq(a --b x\n x); print qq(match at positions $-[1]-), $+[1]-1 if $str =~ m{ ($c* x) }xms; " match at positions 6-6 perl -wMstrict -le "my $c = qr{ (?> -- [^\n]* (?:\n|\z)) }xms; my $str = qq(a --b x\n x); print qq(match at positions $-[1]-), $+[1]-1 if $str =~ m{ ($c* ^ [ ] x) }xms; " match at positions 2-9 perl -wMstrict -le "my $c = qr{ (?> -- [^\n]* (?:\n|\z)) }xms; my $str = qq(a --b x\n x); print qq(match at positions $-[1]-), $+[1]-1 if $str =~ m{ ($c* ^ x) }xms; "

    (The last example had no output, i.e., no match.)

    BTW - These examples were run under Perl 5.8. The 5.10 regex possessive quantifiers might be worth investigating, but I think they are just shorthand for the (?> ... ) 'atomic' construct and would give the same results.