swackerl has asked for the wisdom of the Perl Monks concerning the following question:

Below is a simplified version of the code that I'm trying to get to work:
my $str = "start b2 end start b2 b2 end start b2 end"; if ($str =~ /start(?!.*start.*)(.*?b2.*?b2.*?end)/) { print "Regexp matched!"; }
An explanation of what I'm trying to match is a substring of $str where it is made up of an 'start' marker, followed by two 'b2' markers, followed by an 'end' marker (with any text, not just whitespace, interspersed between the markers that I've mentioned). Think of 'start' as marking the start and 'end' marking the end of the possible string. I DON'T want the expression to overlap between two strings such as in: "start b2 end start b2 b2 end" where the matched string contains an additional 'start' marker. I've tried doing this with look-ahead and look-behind assertions. The problem with a lookahead assertion is that it will search to the end of the string, so if there is any 'start' AFTER the string that I want to match, no matches will be found. The problem with a lookbehind assertion is that it does not allow for variable length strings. Can anyone help me solve this difficult problem?

Replies are listed 'Best First'.
Re: Keeping lookahead assertion from looking to the end of the string?
by VSarkiss (Monsignor) on Sep 05, 2002 at 03:34 UTC

    Hm, this will be very difficult to do in a single regexp. The non-overlapping requirement is similar to the problem of Matching C-style comments. Plus all those .*? are going to make the regex engine go nuts.

    A real-live parser would probably be much easier for this. Take a look at Parse::RecDescent, or Parse::Yapp if you're already familiar with YACC/Bison.

Re: Keeping lookahead assertion from looking to the end of the string?
by Limbic~Region (Chancellor) on Sep 05, 2002 at 04:58 UTC
    I am not sure that you are making this harder than it is. If you always have an end marker following a start marker than it is trivial. If on the other hand, you may see something like:

    "start b2 start b2 end start b2 b2 end"

    then the only way I see to do it would be to use functions like index and substr to iterate over the string removing pieces of it until it is in the form you want before regex'ing it.

    If on the slight chance you will always see a start marker, some text that does NOT include another start marker, and then an end marker, you can simply do this:

    my $str = "start b2 end start b2 b2 end start b2 end"; if ($str =~ /.*start((.*?b2.*?b2).*?)end/) { print "$1 is between \"start\" and \"end\"\n"; }
      Limbic~Region, thanks and I think that the solution you provided will work the best. I don't know why I didn't think of using the ".*" at the beginning of the string before to make it match minimally over "start .. end". Thanks for all of the help!
        After taking another look, I think that the original reason why I wanted to use lookahead expressions was that the "start ... end" strings may be nested, as in the example provided by Limbic~Region. It looks like I'll need to use looping and string manipulation in place of a single regular expression. *sigh*
      Doesn't always work:

      my $str = "start b2 end start b2 end start b2 end"; #only one b2 between each start and end if ($str =~ /.*start((.*?b2.*?b2).*?)end/) { print "$1\n"; } __END__ Output: b2 end start b2
      shouldn't match when there is only one 'b2' between start and end, but it does.

      IMO, the most robust solution is to use a parser (Parse::RecDescent) or multiple regexes.

Re: Keeping lookahead assertion from looking to the end of the string?
by Django (Pilgrim) on Sep 05, 2002 at 07:23 UTC

    If you just want to check if the string matches, Limbic~Regions example will probably do it.

    Although you didn't say it, I suppose you want to catch the substrings between 'start' and 'end' - otherwise you don't have to care about overlapping. In that case you could do something like the following:

    #!usr/bin/perl use warnings; $_ = "start b2 end start b2 b2 end start b2 end"; @Matches = / # list context to catch all (?<= \b start \b )# 'start' must precede ( .*? ) # catch fewest possible anythings (?= \b end \b )# 'end' must follow /gx # global and expressive and print join("\n", @Matches); __DATA__ b2 b2 b2 b2

    update: normal matching without look-ahead and look-behind will have the same effect with the example string:

    @Matches = / \b start \b ( .*? ) \b end \b /gx

    update: I overlooked that Limbic~Regions regex catches the substring too. Shame on me!

    ~Django
    "Why don't we ever challenge the spherical earth theory?"

      Um, my RegEx does catch the substring between "start" and "end" as depicted in the print statment. As always, TIMTOWTDI.