This is not a tutorial.

Perl's regex engine is not lightweight

Every time you use $`, $& or $', the entire scalar you are searching is copied.

What's more, not only are the scalars that you process using the regex contain a reference to one of those variables copied, but every scalar, processed by every regex in your entire program also gets copied.

Further, every time you use capturing brackets, all the captured chunks are also copied--again.

And, even correctly written regexes that use two or more variable length matches (<re>* or <re>+ etc.) can consume prodigious amount of runtime stack and cpu.

Badly and/or naively written regexes that use nested qualifiers can have exponential runtimes, and if the scalar they operate on is anything more than modestly sized, can completely consume your process stack before finally trapping having consumed all your process memory allocation, or system swap space--whichever runs out first.

Dooom, gloom, despondency.

More doom gloom and despondency.

Blah, blah, blah.

Oh. and here is a solution that prevents some of the problems by wrapping each call to the regex engine.

It starts anothor process, sends your scalars and the regex to it via sockets. That other process runs the regex on your behalf, and sends the results back via another socket. This neatly eliminates the $& problem, and allows recovery from the stack runaway/memory exhaustion problems whilst keeping your main process' memory requirements to a minimum.


This is not a serious attack on the perl regex engine!

Whilst much of the above is and has been true for the past 5 (8?, 10?) years, most of it could not be otherwise.

And the point is that the regex engine isn't lightweight, and has some vagaries and caveats,

but that hasn't prevented thousands of programmers from writing 100s of thousands of perfectly functional, useful, beneficial scripts that use Perl's regex engine

Note:The stack problem has been very cleverly fixed in a recent build,


In reply to Things you should need to know before using Perl regexes. (Humour, with a serious point) by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.