toonski has asked for the wisdom of the Perl Monks concerning the following question:

Say a want to match anything between two of the same word:
"The quick brown fox jumped over the other quick brown fox"

it would get the following matches:
"brown fox jumped over the other"
"fox jumped over the other quick"
"jumped over the other quick brown"

I'm trying, but I dont know the metacharacters well enough. I would like, however, to do this in one regex if possible. Any suggestions?

Replies are listed 'Best First'.
Re: Regex help
by BrowserUk (Patriarch) on Jan 29, 2004 at 23:07 UTC

    I don't think there is any direct way of doing this with a regex as there is no way to do a capture of the encompassed string without advancing the position.

    Thus, you have to reset the position (pos) to the end of the first bracketing word ($+[1]) after each match.

    $s = 'The quick brown fox jumped over the other quick brown fox'; print $2 and pos($s) = $+[1] while $s =~ m[\b(\S+)\b(.+?)\b\1\b]g; brown fox jumped over the other fox jumped over the other quick jumped over the other quick brown

    this includes the white space either end of the encompassed text. To avoid that change the regex to  m[\b(\S+)\b\s+(.+?)\s+\b\1\b]g.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Timing (and a little luck) are everything!

Re: Regex help
by Zaxo (Archbishop) on Jan 29, 2004 at 23:11 UTC

    You want capture and lookahead, print $2, $/ while /(\b\w+\b)(?=(.*?)\1)/g; \1 there is the form of backreference needed in regexen.

    After Compline,
    Zaxo

      Very cool, but the output of print $2, $/ while /(\b\w+\b)(?=(.*?)\1)/g; is:

      brown fox jumped over the other fox jumped over the other quick jumped over the other quick brown o #????
      In order to pass over the "o" (from "the o_the_r") the backreference in the lookahead needs to be anchored to word boundaries, ie:

      print $2, $/ while /(\b\w+\b)(?=(.*?)\b\1\b)/g;

      Not to mention, what is the desired output if the string is "the one the two the"?

      I can't believe you missed out on this opportunity to wave (--) the magic wand variable ($|) to get only the odd or even elements of a list:
      print grep --$|,($|||=1)&& /\b(\w+)\b(?=(.*?)\b\1\b)/g'
        What the good lord does that code do? Or rather, how does it work? Whats the $|||=1 bit, etc?