cdherold has asked for the wisdom of the Perl Monks concerning the following question:

Dearest Monks,

I am trying to set up what I thought would be a basic regex to pull out some text, but I can't seem to get the right code jotted down.

I am trying to match everything between "a" and the third instance of "c" in the following string: "abcbcbc" (where "b" is variable). Output should be "bcbcb". But I can't seem to find a way to set the closing parenthetical of the text capture at a point after the first instance of the closing pattern.

$body = "abcbcbc"; $body =~ /a(.*?)(third instance of c)/;

How should I specify the third instance of "c"?

Thanks Monks!

Chris
  • Comment on Regex text extraction b/w first intance of pattern X and third instance of pattern Y.
  • Download Code

Replies are listed 'Best First'.
Re: Regex text extraction b/w first intance of pattern X and third instance of pattern Y.
by BrowserUk (Patriarch) on Mar 31, 2010 at 05:22 UTC

    The "third instance of c" really means "everything not a c plus a c, plus everything not a c plus c, plus everything not a c upto but excluding the next c":

    "abcbcbc" =~ m[ a ( [^c]* c [^c]* c [^c]* ) c ]x and print "'$1'";; 'bcbcb' ## or "abcbcbc" =~ m[a((?:[^c]*c){2}[^c]*)c] and print "'$1'";; 'bcbcb'

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Regex text extraction b/w first intance of pattern X and third instance of pattern Y.
by ikegami (Patriarch) on Mar 31, 2010 at 05:14 UTC

    You'll have better luck with regex if you think in terms of what you do want to match rather than what you don't want to match or what's around what you want to match.

    my ($match) = $body =~ /(?<=a)([^c]*c[^c]*c[^c]*)(?=c)/;
    The above can also be written as
    my ($match) = $body =~ /(?<=a)([^c]*(?:c[^c]*){2})(?=c)/;
    Or maybe you want
    my ($match) = $body =~ /(?<=a)([^ac]*(?:c[^ac]*){2})(?=c)/;
Re: Regex text extraction b/w first intance of pattern X and third instance of pattern Y. (ass u me)
by tye (Sage) on Mar 31, 2010 at 13:57 UTC

    I won't assume that "b is a variable" doesn't imply that "c" isn't a regex pattern instead of just a fixed string. For my example, I'll use your use of the term "body" to make a wild guess.

    my $a= qr{<table[^>]*>}; my $c= qr{<tr[^>]*>}; my $d= qr{</table>}; for( $body ) { die "No $a" if ! /$a/g; my $start= pos(); die "No 1st $c" if ! /$c/g; die "No 2nd $c" if ! /$c/g; die "No 3rd $c" if ! /(?=$c)/g; my $end= pos(); pos()= $start; die "No 3rd $c before $d" if /$d/g && pos() < $end; return substr( $_, $start, $end-$start ); }

    Also, the first two responses don't enforce "first instance of $a" and so may cause problems if the requirements have the temerity to evolve (oh, I also pre-evolved them for you in my example). ;)

    - tye        

    P.S. My real (wild) guess is that you are parsing e-mail bodies.

Re: Regex text extraction b/w first intance of pattern X and third instance of pattern Y.
by JavaFan (Canon) on Mar 31, 2010 at 10:43 UTC
    Unlike the two previous posters, I won't make the mistake of assuming you are using 'a', 'b' and 'c' here in another role as placeholders. In particular, I presume 'c' here to be some string, which can even be longer than a single character.

    Sometimes, it's easier to do things without a complicated regexp. Why not use index to find a starting position of the third appearance of 'c' first?:

    my $index2 = -1; foreach (1 .. 3) { $index2 = index($body, "c", $index2 + 1); die "No third 'c'" if $index2 < 0; }
    Then use pos() to find where 'a' finished matching (or use index() as well):
    $body =~ /a/g or die "No first 'a'"; my $index1 = pos($body); # my $index1 = index($body, 'a') + length('a');
    Then grab what's in between using the offsets:
    my $inbetween = substr($body, $index1, $index2 - $index1);
    A few things are left as an exercise to the reader:
    • What to do if the third occurrence of 'c' appears before the first 'a'.
    • Use a match instead of index if 'c' is a pattern instead of a string.
    • The above code counts overlapping occurrences. It's easy to adapt if you don't want to count overlaps.
    • The code above assumes you want the third occurrence of 'c' from the start of the string. It's easy to adapt to get the third occurrence of 'c' following the 'a'.
      I won't make the mistake of assuming you are using 'a', 'b' and 'c' here

      Why is it a mistake to assume that he is doing what he says he is doing, and what his example shows he is doing?

      I presume 'c' here to be some string, which can even be longer than a single character.

      A why is it better for you to presume that he means something other than what he says?

      I think that your re-interpretation of the question asked, is a valid and interesting adjunct to it, but why the baseless snide narrative?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        BrowserUK:

        I think the "expanded" problem is reasonable given the title, since the title specifies patterns X and Y rather than specific characters. The way I read the OP is that (s)he properly simplified it to a trivial case in the example.

        ...roboticus

      If it's not just c, my solution still works with the equivalent of [^c] for arbitrary patterns. But since I don't know what pattern that might be, I stuck to what the OP said rather than inventing stuff up.
Re: Regex text extraction b/w first intance of pattern X and third instance of pattern Y.
by repellent (Priest) on Apr 01, 2010 at 04:56 UTC
    sub match_in_between { my ($str, $r1, $r2, $n) = @_; return undef unless 4 == grep { defined() } ($str, $r1, $r2, $n); my ($match, $end) = ($str =~ /$r1((?:.*?($r2)){$n})/); return "" unless defined($match); return substr($match, 0, rindex($match, $end)); } print match_in_between("abcbcbc", qr/a/, qr/c/, 3); # bcbcb print match_in_between("cccabcbcbc", qr/a/, qr/c/, 3); # bcbcb print match_in_between("abcbxbcff", qr/a/, qr/b./, 3); # bcbx