in reply to Optimizing a regex
Another little hint, though not directly to do with the regex itself:
A lot of the overhead in a large document is to do with reading line by line in a while construct.
If you compose a routine to use the call 'read' to read in blocks of data at a time (set the size to something meaningful, according to your doc size. I use about 30k for a log reader I wrote).
This does mean keeping track of split lines, and subsequent recombination of these between large data passes, but that's happily resolved using rindex to find the last new line character on a line, and buffering that for inclusion in subsequent reads.
However, once you get round this extra bit of coding, you end up being able to do your search in a multiline regex, without a lot of the iteration overhead. Using this technique, along with pre-compiled regexes, a log reader here has been optimised from around a 5 min run time on a set of data down to 1 min 20 secs.
Anyhow, this is just a little addendum to other comments here, and although indirect, it may help a little in the long run.
Cheers,
Malk