Re: Large file processing

Firstly I'd recommend local as above, and also comment that setting it to "" results in splitting on empty lines as a maintainer (or even you) may not immediately see that in a year or so.

That aside, this is a reasonable way to process your file. If it is still taking too long obviously you can look at process_paragraph and see whether it is efficient.

One thing that strikes me is that you're duplicating lots of strings (1 Gb x 2 at least). It may be more efficient to pass around references (ie \$_).

Also, if you are on a multi-core/cpu box you might want to split the file into a number of pieces and process it that way. A cheat-easy way would be to have one instance of your script process odd paragraphs and another process the odd. Better would be to use seek to skip to half way (or an amount appropriate to your number of instances) and start at the next paragraph. Using the seek method you will need to be careful not to process the same paragraph more than once (or to skip the boundary paras).

The key with all optimisations is to make sure you are actually speeding things up by benchmarking your initial solution, and re-benchmarking your proposed changes. See Benchmark

Comment on Re: Large file processing