Another little hint, though not directly to do with the regex itself:
A lot of the overhead in a large document is to do with reading line by line in a while construct.
If you compose a routine to use the call 'read' to read in blocks of data at a time (set the size to something meaningful, according to your doc size. I use about 30k for a log reader I wrote).
This does mean keeping track of split lines, and subsequent recombination of these between large data passes, but that's happily resolved using rindex to find the last new line character on a line, and buffering that for inclusion in subsequent reads.
However, once you get round this extra bit of coding, you end up being able to do your search in a multiline regex, without a lot of the iteration overhead. Using this technique, along with pre-compiled regexes, a log reader here has been optimised from around a 5 min run time on a set of data down to 1 min 20 secs.
Anyhow, this is just a little addendum to other comments here, and although indirect, it may help a little in the long run.

Cheers,

Malk

In reply to Re: Optimizing a regex (at a tangent) by Malkavian
in thread Optimizing a regex by ZydecoSue

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.