comment on

Heh. That was tricky to get to work right. But let me save you some reading. First, [^"]* doesn't suppress backtracking. It's just matches zero or more of anything that isn't a quote. [^...] is a negation class; the carat means don't match any of the characters between the square brackets.

Now let me explain zero-width lookaheads so you know what that code is about. When you do a match against $string, Perl keeps track of the offset from the start of the string, which you can get (or set, actually) with pos($string). It makes more sense when you are doing multiple matches against the same string. Let's say I want to collect all the peppers in the following:

  $s = "I'm a pepper, he's a pepper, you're a pepper, she's a pepper..
+.";
  while( $s =~ /(pepper)/g ) {
    push @peppers, $1;
  }
[download]

This will put 4 peppers in the array @peppers. As you probably know, the /g modifer tells the match to remember the position after the last match, and the next time through the loop, start looking for another match from that point. So...

  I'm a pepper, he's a pepper, you're a pepper, she's a pepper...
  ^           ^              ^                ^               ^
  0           1              2                3               4
[download]

...the first time through the loop, pos($s) is 0. After matching the first time, the offset is 12, at position 1. After the next match, the offset is at position 2, and so on until there are no more matches.

A zero-width lookahead means a) do the match and b) if successful, put the offset back where it was before you started. So, in this example...

  $s = "Wouldn't you like to be a pepper too?";
  #     ^                               ^
  #     0                               1
  $s =~ /(?=pepper).+like/;
[download]

...the regex starts searching from the start of the line by default and matches pepper. But it doesn't change the offset to position 1. Instead, it leaves it at position 0, where it searches forward to find 'like'. $s =~ /pepper.+like/ would have failed because after matching pepper, the offset would be at position 1, and searching forward won't find 'like'. The code is the equivalent of $s =~ /pepper.+like|like.+pepper/. It's more useful when parsing complex phrases, like language, where a verb, for example, can be followed by more than one type of word or phrase.

But getting back to your post:

print "Spam!!!\n" if
   $text =~ /^To:
              \s*
                "
                  (?!.+Furry Marmot)
                  [^"]*
                " 
              <marmot\@furrytorium\.com>
            /mx;
[download]

A negative lookahead is like the positive lookahead, above, but succeeds when the search term is not found. In the middle of the regex above, I want the match to succeed if there are two quotes, but they do not contain 'Furry Marmot'. It won't work if I try to match "(?!.+Furry Marmot)" because that says a) find a double-quote, b) don't find 'Furry Marmot' and then leave the offset just after the quote, and c) find the closing quote. This can only match "".

Instead, once we have determined that Furry Marmot is not after the first quote, match zero or more of anything that isn't another quote, up to the closing quote. Now we can check what's in the email address.

This is just a simplistic example, and probably would be overkill if you tried to accomodate multiple addressees, a CC: or BCC: line, etc., but I hope it helps you learn regexes. They are one of my favorite parts of Perl. :-) There's a very good description of backtracking in perldoc perlretut. That and perldoc perlre will shed a lot of light on this.

--marmot

In reply to Re^5: Looking for ideas on how to optimize this specialized grep by furry_marmot
in thread Looking for ideas on how to optimize this specialized grep by afresh1

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.