This is not a tutorial.

Perl's regex engine is not lightweight

Every time you use $`, $& or $', the entire scalar you are searching is copied.

What's more, not only are the scalars that you process using the regex contain a reference to one of those variables copied, but every scalar, processed by every regex in your entire program also gets copied.

Further, every time you use capturing brackets, all the captured chunks are also copied--again.

And, even correctly written regexes that use two or more variable length matches (<re>* or <re>+ etc.) can consume prodigious amount of runtime stack and cpu.

Badly and/or naively written regexes that use nested qualifiers can have exponential runtimes, and if the scalar they operate on is anything more than modestly sized, can completely consume your process stack before finally trapping having consumed all your process memory allocation, or system swap space--whichever runs out first.

Dooom, gloom, despondency.

More doom gloom and despondency.

Blah, blah, blah.

Oh. and here is a solution that prevents some of the problems by wrapping each call to the regex engine.

It starts anothor process, sends your scalars and the regex to it via sockets. That other process runs the regex on your behalf, and sends the results back via another socket. This neatly eliminates the $& problem, and allows recovery from the stack runaway/memory exhaustion problems whilst keeping your main process' memory requirements to a minimum.


This is not a serious attack on the perl regex engine!

Whilst much of the above is and has been true for the past 5 (8?, 10?) years, most of it could not be otherwise.

And the point is that the regex engine isn't lightweight, and has some vagaries and caveats,

but that hasn't prevented thousands of programmers from writing 100s of thousands of perfectly functional, useful, beneficial scripts that use Perl's regex engine

Note:The stack problem has been very cleverly fixed in a recent build,

  • Comment on Things you should need to know before using Perl regexes. (Humour, with a serious point)
  • Select or Download Code

Replies are listed 'Best First'.
Re: Things you should need to know before using Perl regexes. (Humour, with a serious point)
by Corion (Patriarch) on Oct 25, 2006 at 09:18 UTC
    Further, every time you use capturing brackets, all the captured chunks are also copied--again.

    Not exactly ;)

    Q:\>perl -le "($x = 'foo') =~ /.(.)/g; print $1; $x = 'bar'; print $1" o a This is perl, v5.8.2 built for MSWin32-x86-multi-thread

    Of course, the /g modifier there is a bug in the code, but it still shows that not in every case, a capturing match copies the buffer.

    Thanks to dave_the_m and demerphq's recent work, the Perl5.10 regex engine is improving even more beyond the C recursion elimination. It has named captures that bring it up to par and beyond what the other named captures provide, and it has quite the speedup against Unicode strings as far as I understand the changes. There are some deeper problems with how closures-in-regular expressions are handled vs. interpolation (in (?{..}) blocks).

    This post is mostly about adding some perspective to the changes that happen to the regex engine ;)

      Q:\>perl -le "($x = 'foo') =~ /.(.)/g; print $1; $x = 'bar'; print $1" o a
      wow. didn't expect that.

      betterworld and i just tried out some other examples, and it seems that the string buffer is not really emptied.

      $ perl -MData::Dumper -we'$Data::Dumper::Useqq = 1; ($x = "fou") =~ /.(..)/g; print Dumper $1; $x = "b"; print Dumper $1; ' $VAR1 = "ou"; $VAR1 = "\0u"; $ perl -MData::Dumper -we'$Data::Dumper::Useqq = 1; ($x = "fou") =~ /.(..)/g; print Dumper $1; $x = "ba"; print Dumper $1; ' $VAR1 = "ou"; $VAR1 = "a\0";
      so $1 just outputs the second and third character of the string, and in the first example you see the remains of the string 'fou'
Re: Things you should need to know before using Perl regexes. (Humour, with a serious point)
by liz (Monsignor) on Oct 25, 2006 at 10:06 UTC
    ;-)

    I guess Persiflage is the sincerest form of flattery.

    Liz

      Only be sure always to call it please, "research".
Re: Things you should need to know before using Perl regexes. (Humour, with a serious point)
by bennymack (Pilgrim) on Oct 25, 2006 at 12:15 UTC

    I'm curious what you were actually trying to link to in:

    Oh. and here is a solution that prevents some of the problems by wrapping each call to the regex engine.

    Because it just does a search for the word "here" AFAICT

      It's a placeholder for a link to the module described for when I get around to writing it.

      (You did notice the word "Humour" in the title didn't you :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Things you should need to know before using Perl regexes. (Humour, with a serious point)
by Anonymous Monk on Oct 25, 2006 at 19:41 UTC

    Your title to the contrary, I find it hard to see much humor in this post. I read it as more of your relentless bitterness at people who talk down Perl's threads. I think I understand your view -- that threads are good (the future, as you put it) and Perl's threads won't get better unless people use them. But ignoring the reality of Perl's threads is not going to help, either.

    Also, lest you accuse me of ignoring the superficial "point" of your post, the regex engine is indeed a monster. The amount of work being put into it in bleadperl should be a good indicator of that. I am not going to argue with you on this point, but that is unrelated to the threads issues.

      the regex engine is indeed a monster. The amount of work being put into it in bleadperl should be a good indicator of that.

      I agree with the first point. I'm not so sure about the second point. Its hard to calculate how much of my dev time has been related to the complexity and opacity of the engine and how much has been other things. I just dont see a necessary relationship between monstrosity and the time spent on development. I wonder what dave_the_m would say.

      I kinda wonder at how any industrial strength regular expression engine could be anything but a monster. A programs structure comes to reflect its problem domain I think, and when the domain is complex, the code will be too. And I think that the problem domain of search and replace with perl style not-so-regular regular expressions is quite complex. Ive looked at the sources for the latest TCL engine, and PCRE and they are all large comprehensive bodies of code. Ill grant that Perls is not the cleanest implementation, nor the best documented but IMO all of those packages are monsters.

      I guess it all comes down to what you consider monster code to be.

      ---
      $world=~s/war/peace/g

      Your title to the contrary, I find it hard to see much humor in this post. I read it as more of your relentless bitterness at people who talk down Perl's threads.

      I'll trade you s/Humour/Parody/, if you'll trade me s/bitterness/frustration/?

      ... the superficial "point" of your post, the regex engine is indeed a monster.

      The (more than superficial) point of the post is that despite all the regex engines flaws, they haven't stopped it from being one of the great strengths of Perl 5. Nor have they prevented a huge number of very useful scripts being written that utilise it and run day in, day out in production environments all over the world.

      And through it's continual use, and the bug reports and feedback that they generate to the guys that maintain it, it has gone from strength to strength to strength. And continues to do so. No one will be pointing new questions about the regex engine to the OP, despite the truths it holds, because they know it has flaws, but they also know that the benefits of using it, with a modicum of care, far out-weight the risks of the flaws.

      Equally, no one, least of all me, is denying the flaws in iThreads. Do a supersearch against my handle for threads clone copy fork (you may need to specify an alternate delimiter to get OR functionality), and see how many posts I have devoted to noting and reiterating the problems with the iThreads architecture.

      Despite that, I have continually promoted the idea that for certain kinds of problems, with a modicum of care, using iThreads results in simpler, cleaner, more maintainable and reliable solutions than the alternatives.

      If no one used the regex engine, there would have been no incentive for it's improvement. If no one uses threads, there will be no incentive for them to improve.

      But that is still not the biggest point I am trying to make.

      Many of the limitations of iThreads are so fundamental, that it is doubtful whether they can ever be fixed. These are not bugs, but design and implementation limitations that would require huge changes to the core of perl 5 to eliminate. They come about through a combination of three main factors:

      1. As I pointed out elsewhere, retro-fitting threading to an existing, complex, mature product that was never intended to be used in a threaded environment is not just extremely technically challenging. It is damn nearly impossible without a ground up re-write.

        This is why I have described the work, and the people who achieved it, to give us iThreads as "heroic". I do not, ever, use that term lightly or sarcastically.

      2. The api chosen upon which to base perl threading, is the severely limited, strict POSIX pthreads description.

        This api is minimal, weak and flawed. If you doubt my opinion on this, look around at all the *nix platforms that have extended it, often in mutually incompatible ways.

      3. The emulation of the fork mechanism.

        Without COW, this is hugely expensive of memory. Even with COW, it is hugely costly in time.

        In a program that was never written to be used in a threaded environment, with run times that originate from long before threading was ever a consideration--ie. before reentrancy was ever considered a virtue and so are littered with non-reentrant apis (like strtok) and hard coded limits (like FILE* structs) and that have adapted to reentrancy through the path of least resistance--there is simply too much global and static data littered in isolated pockets at all levels for this to be efficient.

      Unless these limitations are explored and exposed--which requires that people use them--then there would be nothing to stop the next generation of perl P6, from making exactly the same decisions and exactly the same mistakes.

      And that , I strongly believe, is a point worth making. Even at the expense of stepping on a few peoples toes.

      liz, the author of the post I parodied, was the second person to respond to the OP and she did so in a far better way than I could ever have hoped for. She saw both the point and the funny side. Despite my implicit and explicit criticisms levelled at her with regard to the effect of her post upon the fortunes and reputation of iThreads, she has gone on to more than make up for it by contributing her time to the development of Perl 6 threading.

      In summary.

      • Yes. Threading has a place in any modern language, because it can solve some problems more simply and more efficiently than any other solution.
      • Yes. The in the next few Moore's cycles there will be more and more opportunities for threading to be used beneficially by application programmers, to reduce complexity as well as improve performance.

        No. It is not a total replacement for fork, nor Events, nor state machines, nor clusters. It complements them all. It adds another tool to the programmers toolkit that solves some problems that the other forms of parallelisation either cannot solve; or more easily than they can solve it.

        It also provides for an imperfect solution to traditionally forked problems on those platforms that do not have fork. Like my own preferred platform. I wish win32 had a proper fork. There is absolutely no technical reason why it could not. That cygwin can do it, albeit rather slowly and laboriously, is one good indication of this. That my own attempts have come close to achieving it is for me another.

        But politics is politics, and I have no expectation that MS are about to have a change of heart.

      • Yes. iThreads are flawed.
      • No. iThreads are not unusable. Even in production environments, given the appropriate knowledge and care.
      • Yes. Locking can be a pain.

        Locking is is no more painful or difficult than dealing with file locks or record locks or IPC semaphores.

        And, used from Perl5, with its explicit protection of non-shared data from accidental concurrency; its isolation of its own internal structures from the application programmer and the removal of the need for them to concern themselves with the locking of those structures; and its provision of lexically scoped locking primitives; it's a lot easier in Perl than in many other languages.

      • Yes. iThreads can improve, but people need to discover the limitations and bugs before those improvements can be made.
      • Yes. All experience gathered from the development and use of iThreads can be usefully harnessed to ensure than a better design and implementation is used in the underpinnings of Perl 6.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.