The most common category of regex questions I see is about greediness and non-greediness, and I think many of you will agree: it's a potentially tricky concept for beginners to understand, and it can bite even seasoned regexers from time to time, present company included. A good number of the questions in this category have to do with matching at the end of a string. Here's an example:
n00b
I have the regex /#(.*?)$/ and the string "this #is an #example string". I want to match "example string" with the regex, but for some reason it's matching "is an #example string" even though I've got a non-greedy .* and I've got it anchored to the end of the string.
Invariably, you'll hear answers like "regexes are left-most longest" and "you want [^#]* instead of .*?" and "start your regex with .*". But I don't think I've seen an answer (at least, it hasn't caught my eye like the way I'd expect it to) that says "you aren't anchoring your regex to the end of the string".

Thus, this meditation. We often refer to \A \b \B \G \z \Z ^ $ as "anchors", but I feel this is a misnomer, a simplification of their real name (which would be something like "string location assertion") that gets in the way of their actual purpose. One could say that the regex /f$/ is anchored to the end of the string it matches, and one would be essentially right, since such a simple regex triggers a couple optimizations in the regex engine that result in the regex really meaning "look for an 'f' at the end of the string" rather than "find an 'f' followed by END-OF-STRING". Clearly, in a string with many f's, the optimized version is much smarter and faster.

But the regex /\s+$/ is not anchored (and this grieves me). I've used this example time and time again when explaining regex reversal and that whole tangent, and it's useful again. If the regex were really anchored (that is, immovable), it would find the end of the string, and then match the regex in that context -- that is, the regex would have to terminate at the end of the string. As it stands, Perl can't optimize the regex that way, and we end up matching EVERY chunk of whitespace, and then testing to see if END-OF-STRING comes after it.

So I think "anchor" is incorrect. But "string position assertion", while accurate, is clunky. So where do we go from here? I'm not sure. This is mainly me venting a frustration. Ideally, in Perl 6, we'll have real anchors and regexes that do what we mean.

Hi ho anchor, aweigh!


Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

In reply to The "anchor" misnomer in regexes by japhy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.