coldy has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am manipulating a large file ~ 1gb and my program seems to be very inefficient. At the moment my plan of attack is to read it by paragraph, analyze the paragraph and write to a new file. this is convenient as the paragraphs within the file need to be analyzed separately. I hope this is not too vague a question but is there a more efficient way of doing this? apart from the way Im doing it;
$/ = ""; open(F, "<$infile") or die "Could not open $infile for a reading!\n" +; while ( <F>) { my $line = process_paragraph($_); print OUT2, "$line\n"; }
hopefully somebody could point me in the right direction Cheers

Replies are listed 'Best First'.
Re: Large file processing
by pc88mxer (Vicar) on Jul 14, 2008 at 04:51 UTC
    Your approach seems pretty reasonable. Just one pointer:
    { local($/) = "...end of paragraph delimiter..."; while (<F>) { my $line = process_paragraph($_); print OUT2 $line; } }
    This localizes the value of $/ so that it gets reset upon leaving the block. It's a defensive measure so that you don't get surprised later on when you expect $/ to contain the default value.

    Also, note that in the while loop, $_ will contain the end of paragraph delimiter. You can chomp it or just leave it, depending on what your processing routine does.

Re: Large file processing
by GrandFather (Saint) on Jul 14, 2008 at 04:56 UTC

    Where is your code spending its time? If it's in process_paragraph we need to know the details of process_paragraph. If it's in file I/O there's probably not much you can do. You have already avoided the first trap - slurping a large file into memory. The OS is probably already buffering the file I/O for you so there is unlikely to be much advantage gained by doing your own buffering.

    If your paragraphs are fixed length you may gain a little by using read instead of <>, but that gain is likely to be pretty small.


    Perl is environmentally friendly - it saves trees
Re: Large file processing
by aufflick (Deacon) on Jul 14, 2008 at 09:20 UTC
    Firstly I'd recommend local as above, and also comment that setting it to "" results in splitting on empty lines as a maintainer (or even you) may not immediately see that in a year or so.

    That aside, this is a reasonable way to process your file. If it is still taking too long obviously you can look at process_paragraph and see whether it is efficient.

    One thing that strikes me is that you're duplicating lots of strings (1 Gb x 2 at least). It may be more efficient to pass around references (ie \$_).

    Also, if you are on a multi-core/cpu box you might want to split the file into a number of pieces and process it that way. A cheat-easy way would be to have one instance of your script process odd paragraphs and another process the odd. Better would be to use seek to skip to half way (or an amount appropriate to your number of instances) and start at the next paragraph. Using the seek method you will need to be careful not to process the same paragraph more than once (or to skip the boundary paras).

    The key with all optimisations is to make sure you are actually speeding things up by benchmarking your initial solution, and re-benchmarking your proposed changes. See Benchmark