Dumu has asked for the wisdom of the Perl Monks concerning the following question:

In his book 'Mastering Perl', brian d foy says:
Perl's feature that we call "regular expressions" really aren't; we've known this ever since Perl allowed backreferences (\1 and so on).
What is it about backreferences that makes Perl regexen irregular?
  • Comment on Why aren't perl regular expressions really regular expressions?

Replies are listed 'Best First'.
Re: Why aren't perl regular expressions really regular expressions?
by Corion (Patriarch) on Mar 17, 2015 at 10:30 UTC

    The term "Regular Expression" originally comes from computer science. The set of Regular Languages is more or less equivalent to the set of Regular Expressions, but it is smaller than the set of languages accepted by Perl regular expressions. As the documentation already says, for example backreferences cannot be expressed in a Regular Expression and are not available in the set of Regular Languages.

    Update

    Any Regular Expression has a finite number of states. Arbitrary backreferences cannot be expressed in a fixed automaton with a finite number of states. This means that backreferences are not available in Regular Expressions.

Re: Why aren't perl regular expressions really regular expressions?
by LanX (Saint) on Mar 17, 2015 at 10:22 UTC
    Simplified: real regular expressions have a narrowly defined semantic and are implemented as a state machine while perl emulates this with kind of op-codes for higher flexibility and richer syntax (like mentioned \1 back reference or embedded perl code)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)

    PS: Je suis Charlie!

    update

    For clarification:

    \1 (mainly match part) is not to be confused with $1 (only replace part) of s/match/replace/

Re: Why aren't perl regular expressions really regular expressions?
by BrowserUk (Patriarch) on Mar 17, 2015 at 11:03 UTC

    They've answered the what, here's the why:

    The 'regular languages' that could be defined in terms of those original 'regular expressions' were so limited as to be practically useless beyond the kind of pointless exams questions so beloved of academia.

    They had to made irregular, to become useful. And despite that the purists hate the irregularity, they've not come up with anything better.

    A bit like the Perl language itself; much of its power derives directly from its irregularity.

    Rant:


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Why aren't perl regular expressions really regular expressions?
by AppleFritter (Vicar) on Mar 17, 2015 at 13:21 UTC

    To expound on the answers you've already received -- the word "regular" in "regular expression" comes from regular languages, a certain kind of formal language (as described by formal grammars).

    There are a number of different kinds of these languages that arise fairly naturally; the most important ones are collected in the so-called Chomsky hierarchy. They're characterized by increasingly strict conditions imposed on their grammars' production rules; the most general kind of language (recursively enumerable) has no restrictions, context-sensitive and context-free languages have some, regular languages have the most stringent. (The most stringent in this hierarchy, that is: there are languages that would rank below regular languages still, e.g. star-free languages.)

    Interestingly, all these languages are computed by certain kinds of formal machines, from Turing machines (or equivalent notions of computation, e.g. μ-recursive functions) for recursively enumerable languages to finite automata (deterministic or non-deterministic, it makes no difference) for regular languages.

    Regular expressions, in the theoretical sense (rather than Perl's), directly describe finite automata. The only things you need there are concenation, alternation ((...|...) in Perl) and the Kleene star ((...)* in Perl).

    This is what Perl's regular expressions are based on, but in practice, when you're more interested in solving problems than researching formal languages (as fascinating a subject as they are), you'll need more tools in your toolbox, and Perl gained a lot of tools to make your job easier, from references to look-around assertions to non-backtracking subpatterns to pattern interpolation to match-time code evaluation to... well, you get the idea.

    That's the sense in which Perl's regular expression are, in fact, irregular -- or, perhaps equivalently, "Perl-compatible". They go far beyond what "regular expressions" originally were, and as a result they're vastly more useful in practice.

    (On that note, BTW, it's also worth adding that when you encounter "Perl-compatible" regular expressions in any other language, it's pretty much a safe bet that they aren't in fact.)

      > BTW, it's also worth adding that when you encounter "Perl-compatible" regular expressions in any other language, it's pretty much a safe bet that they aren't in fact.

      AFAIK are tools and languages claiming "Perl-compatible" regular expression based on the same same PCRE library and are hence compatible to each other.

      The problem is rather that PCRE isn't fully compatible to Perl, they rather try to mimic the incompatibilities in syntax to POSIX RegExes Gnu Basic RegExes and some added features Larry included into Perl.

      I.o.W the syntax looks compatible but the results may not in edge cases.

      edit

      recommended read Friedl's "Mastering Regular Expression" from O'Reilly.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)

      PS: Je suis Charlie!

        I.o.W the syntax looks compatible but the results may not in edge cases.

        Exactly. The basic features are probably in place - I say "probably", as I've not actually looked at the pcre library -, but not all of the advanced features will be. At the very least I would expect that match-time code evaluation isn't; if you have a regular expression using that in a Perl program, I'd bet you will not be able to copy it unmodified to a script written in another language and having it Just Work™.

        It's not gonna matter most of the time (famous last words, those), but one should still keep in mind that "Perl-compatible regular expressions" are really "almost-but-not-quite-Perl-compatible regular expressions".

        (The fact that different languages all using the pcre library are gonna have compatible regular expressions among each other is added irony.)

        recommended read Friedl's "Mastering Regular Expression" from O'Reilly.
        Yes, I'm reading the first edition of Friedl's famous book (as well as about ten other Perl / related books all at once).

        I specifially opened Mastering Regular Expressions looking for the answer to my above question, but couldn't find it.
Re: Why aren't perl regular expressions really regular expressions?
by choroba (Cardinal) on Mar 17, 2015 at 10:48 UTC
    Wikipedia gives some examples (e.g.
    (.*)\1
    ), too. The expressions are not "irregular", they are rather "non-regular".
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Why aren't perl regular expressions really regular expressions?
by Dumu (Monk) on Apr 12, 2015 at 01:32 UTC

    Thanks again to all Monks who took the time to answer and add to the discussion on this fascinating question, as well as provide links to further information.

    As it happens, I have just found another relevant resource, a link to which will, I hope, make a satisfying coda.

    Celebrated computer scientist Prof. Niklaus Wirth of the Eidgenössische Technische Hochschule Zürich (Federal Technical University of Zürich) recently updated his book 'Compiler Construction', which is available for download from his ETH homepage. For some reason, the 2014 PDF edition is divided into two parts, though of course it can easily be joined up again using PDFtk.

    In Chapter 3, Prof. Wirth states the following (after some previous discussion of Chomsky):

    "regular languages [play] a significant role in the realm of programming languages ... the following, simple definition may be given:

    A language is regular, if its syntax can be expressed by a single EBNF expression.

    The requirement that a single equation suffices also implies that only terminal symbols occur in the expression. Such an expression is called a regular expression."

    This quotation makes more sense in the context of the surrounding book, but I think it's basically saying the same as what Corion says above. EBNF stands, of course, for 'Extended Backus-Naur Form', which is expounded throughout the book - as it is Wirth's own extension of the older BNF.

    With the help and guidance of the above Monks and this volume from Prof. Wirth, I hope more readily to grasp the difference between Perl-legitimate regexen and the apparently much more specific meaning of the term in compiler design.