Uruk has asked for the wisdom of the Perl Monks concerning the following question:

Let's say I have this regex: m/foo(.*?)foo/; What I'm looking to express that I don't know how to is this: how do I make it so that $1 may not contain the substring "bar"? I've looked into ^b^a^r and ^bar both of which are wrong because they either require the string to be at least 3 characters long, don't account for different posiitons of bar within the string, or also disallow "rab" and "arb". So I want to express minimal matching of any arbitrary characters disallowing a certain subpattern. How can I do this? I've heard as a suggestion negative lookahead assertions, but I was under the impression that they were only for matching overlapping pieces of text, i.e. if you were matching: "pop" against "popop" you would use a negative lookahead assertion if you wanted more than one match to be returned. (because the first p of the second "pop" overlaps with the last p of the first match) Any suggestions?

Replies are listed 'Best First'.
Re: Regex'ing backreferences
by zejames (Hermit) on Sep 21, 2000 at 13:24 UTC
    I think the problem here is to say that you do NOT want bar in $1. Because if you wanted it, it may be easier :-)

    Something like this should hopefully do want you want:

    /foo(?!.*?bar.*?)(.*?)foo/;
    You first try to find 'foo'. OK. Then you check if the substring between 'foo' and the other 'foo' contains 'bar', but with a zero-width assertion. If that does not contain 'bar', then you can save it into $1.

    Thank you :-) thanks to your question I have understood why zero-witdh assertion is really cool!!

    Hope this helps. Bye

    James

    Update: and what about doing:

    /foo(?!.*?bar.*?foo)(.*?)foo/;
    I think it is better, unless you show me another "worst case" :-))
      You unfortunately will disallow strings with bar *after* the second foo. What you want is this:
      /foo((?:(?!bar).)*)foo/
      Note that (?:stuff) is a non-grouping set of parens, and (?!stuff) is a negative assertion.

      UPDATE
      I noticed a while ago that my RE was incorrect, decided to leave it for someone else to find and mention. But nobody did so I will mention that I forgot a ?:

      /foo((?:(?!bar).)*?)foo/
        While this method works, it is a bit slow, since it checks for the existence of 'bar' EVERY character. Use the "unrolling the loop" technique, as found in "Mastering Regular Expressions":

        Update: tilly's right, mine fails to match on a specific type of string ("foobabfoo"). Fixed below.
        m{ foo ( # save to $1 [^b]* # 0 or more non-b characters (?: b+ # 1 or more b's (?: a(?!r) # an 'a' NOT followed by an 'r' | # or [^a] # a non-a character ) [^b]* # 0 or more non-b characters )* # that whole b... subset, 0 or more times ) foo # and foo }x;
        The general pattern behind unrolling the loop is to get your regex to look something like OK* ( NOTOK OK* )*.

        $_="goto+F.print+chop;\n=yhpaj";F1:eval
      What about "foohellofoobarfoo"? You should be matching and putting "hello" into $1, but your new version won't.
Be obvious (Re: Regex'ing backreferences)
by tye (Sage) on Sep 21, 2000 at 19:41 UTC

    Here is another alternative:

    /foo(.*?)(foo|bar)/ && "bar" ne $2

    One functional difference between this and tilly's and japhy's correct solutions is that it can clobber regex variables when the first half matches but the total expression is false. Another is that you can't embed it inside a larger regex.

    Which leads me to the "obvious" solution:

    /foo(.*?)foo/ && $1 !~ /bar/

    And after benchmarking the results are:

    Benchmark: running japhy, obvious, tilly, and tye, each for at least 3 CPU seconds (sorted fastest to slowest)... obvious: 3.13 CPU @ 77.96/s (n=244) tilly: 3.41 CPU @ 33.43/s (n=114) tye: 3.29 CPU @ 20.06/s (n=66) japhy: 3.02 CPU @ 10.60/s (n=32)

    So the most "optimized" is the slowest and the most obvious is the fastest. I think I know which one I'll use next time. :) Here is the benchmarking code:

    #!/usr/bin/perl -w use strict; use Benchmark; my @pat= qw( fooxfoo fooxbar xfooxfoo xfooxbar fooxfoox fooxbarx xfooxfoox xfooxbarx ); my @fill= ( "ba", "fo", "bf", "bafo", "babrar" ); @pat= map { my $fill= $_ x 1000; map { ( my $pat= $_ ) =~ s/x/$fill/g; + $pat } @pat } @fill; sub obvious { /foo(.*?)foo/ && $1 !~ /bar/ and $1 } sub tye { /foo(.*?)(bar|foo)/ && "bar" ne $2 and $1 } sub tilly { /foo((?:(?!bar).)*)foo/ and $1 } sub japhy { m{ foo ( # save to $1 [^b]* # 0 or more non-b characters (?: (?>b+) # 1 or more b's (NO BACKTRACKING!) (?: a(?!r) # an 'a' NOT followed by an 'r' | # or [^a] # a non-a character ) [^b]* # 0 or more non-b characters )* # that whole b... subset, 0 or more times ) foo # and foo }x and $1 } sub checkthese { my( $sub, $hTests )= @_; my %res; for my $meth ( keys %$hTests ) { my $res= $sub->( $hTests->{$meth} ); for( keys %res ) { if( $res ne $res{$_} ) { warn "$meth and $_ disagree!\n"; } } $res{$meth}= $res; } } checkthese( sub { join " ", map { $_[0]->() } @pat }, { tilly => \&tilly, japhy => \&japhy, tye => \&tye, obvious => \&obvious, } ); timethese( -3, { obvious => sub { map obvious(), @pat }, tye => sub { map tye(), @pat }, tilly => sub { map tilly(), @pat }, japhy => sub { map japhy(), @pat }, } );
            - tye (but my friends call me "Tye")