You're confusing a couple of things. A negation class just means match something that is not this class of chars. [^abc]+ means match one or more of anything that isn't a, b, or c. It has nothing to do with backtracking.

Read perlretut for sure, and see "Backtracking" in perlre. Generally it's not something you have to worry about unless you have a regex that's running really slowly.

With regard to the match above, it's coded to look at a specific pattern of email -- and it's not all-inclusive -- it just determines whether it matches a pattern that *I* say is or is not spam.
1 To: "Furry Marmot" <marmot@furrytorium.com> 2 To: "Mr.Furry Marmot" <marmot@furrytorium.com> 3 To: "pharmacy" <marmot@furrytorium.com> 4 To: <marmot@furrytorium.com> 5 To: <test@furrytorium.com>

According to my admittedly arbitrary rules, if the display name ("Furry Marmot") is consistent with the local-part of the address (the part before the @ sign: "marmot"), this is a valid address. Also, just the email address without the display name is fine.

But the regex I wrote tests for something that doesn't match that pattern. It says, IF there is something between quotes, BUT that something doesn't include "Furry Marmot", AND the address is "<marmot@furrytorium.com>", THEN it's spam. So the regex matches my definition of spam, failing on not-spam.

So number 1 fails because the regex tests for 'not .+Furry Marmot' between quotes but finds 'Furry Marmot' followed by '<marmot@furrytorium.com>'. The match fails, so it is not spam.

Number 2 also fails because we're testing for 'not .+Furry Marmot' and 'Mr.Furry Marmot' actually is '.+Furry Marmot'. But 'pharmacy' is definitely 'not Furry Marmot'; it's followed by the marmot email address, so Number 3 is spam.

Number 4 and 5 fail because the regex is looking for To: "something between quotes".... There are no quotes at all, so it fails quickly, and failure equals not-spam, so they're both not spam.

Obviously one could come up with a much better regex than my off-the-cuff, narcisstic example. :-) I was thinking of a spam catcher I worked on a long time ago, that included a series of patterns to try matching against a header block. One of the patterns I remember is emails addressed to "Online Pharmacy", but with my address. Another pattern was emails from me, to me, which wouldn't happen with those particular email accounts. But you get into all kinds of issues like, "Would I send an email to myself?" For a lot of people, the answer might be yes. And what about something sent to "Subscriber" <marmot@furrytorium.com>? Is that valid? And what do you do with "Funky Marmot" <marmot@furrytorium.com>? Oooooo, Spam Assassin starts looking good very quickly.


In reply to Re^7: Looking for ideas on how to optimize this specialized grep by furry_marmot
in thread Looking for ideas on how to optimize this specialized grep by afresh1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.