hill has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, Gentle Monks,

Still a duffer so this makes me fret.

I was slurping in a couple of similar sized files with something close to:

for $file (@list) {open INPUT, "$file" or die; @partial = <INPUT>; push @complete, @partial;}
Basically concatenating the files. For an unknown reason (at least to me) the second file was about three times slower reading than the first. Reversing the order of the files exhibited the same behavior--the second file read took significantly longer. Things got better, and a LOT faster with:

for $file (@list) {open INPUT, "$file" or die; @complete = <INPUT>;}
Comments? Explanations? General ideas?

In any case, many thanks for your attention.

Replies are listed 'Best First'.
Re: slow file slurping
by repellent (Priest) on May 01, 2009 at 00:57 UTC
    When you slurp files, you need memory to store those files. Performance may degrade depending on how much memory you use and how much work is performed.

    It makes sense that @complete = <INPUT> is faster because it keeps slurping into (and overriding) the same array. Less memory is used than the first example.

    When you push @complete, @partial, you're copying every "line" from @partial into @complete, thus using more memory and doing more work than the second example.

    Try this instead: push(@complete, $_) while <INPUT>;

    It pushes every line read from INPUT onto the array. This is better than push(@complete, <INPUT>); which creates an intermediate list.

    We can get fancy with: my @complete = do { local (*ARGV, $/); @ARGV = @list; <> };

    Or better, take advantage of someone else's good work: File::Slurp

    Other comments:
    • Be strict: use warnings; use strict;
    • Use 3-argument open if you can. Die with the error $!.
    • Close the input filehandle when you're done. Use a lexical input filehandle if you can.
    • No need to double-quote "$file".

    So, something like:
    use warnings; use strict; my @list = ... my @complete; for my $file (@list) { open(INPUT, "<", $file) or die $!; push(@complete, $_) while <INPUT>; close(INPUT); }
Re: slow file slurping
by jwkrahn (Abbot) on May 01, 2009 at 01:12 UTC

    Try it like this and see if it helps:

    { local @ARGV = @list; @complete = <>; }
Re: slow file slurping
by graff (Chancellor) on May 01, 2009 at 01:59 UTC
    Apart from the points made above (I really like jwkrahn's idea), the non-linear time difference (either one of the files by itself is a lot faster than the two files together), could indicate that the files are big enough (and/or you were wasting enough memory) that you ended up having to use virtual memory (i.e. portions of your in-memory array had to be paged out to the system swap file).

    The time drag when you're using swap will depend somewhat on what you do with your big array after it's loaded. If you're still seeing a serious runtime delay after trying the ideas above, your choices are: (a) put more RAM on the machine (use a machine with more RAM), or (b) figure out how to do what needs to be done without having all that stuff in memory at one time (database? DB_File? -- depends on what you need to do.)

Re: slow file slurping
by lostjimmy (Chaplain) on May 01, 2009 at 00:18 UTC

    I guess I would expect different behavior. Those two snippets aren't doing the same thing. The first is reading the file into one array, then appending that array onto another one. The second just reads into an array, overwriting the previous contents. In fact, in the second program, you aren't getting a concatenation of the two files at all.

    Update: what repellent said :)

Re: slow file slurping
by hill (Sexton) on May 01, 2009 at 11:22 UTC
    Thanks for your attention. I've returned with details not available at home.

    First I need to claim a bit of brain fade. The line "@complete = <INPUT>;" in the second snippet should have read "push @complete, <INPUT>;"--lostjimmy was exactly right with his comment.

    Now back to the rat killing. When I run a pair of files (each about 89Mb) through these snippets, the first method runs about 105 seconds, the second only about four. If these files are concatenated externally to the script (i.e. into a single 180 Mb file) then the first runs in roughly 30 seconds while the second method still takes around four.

    This is running on a 2.4 GHz Windows machine with 4Gb of RAM and the behavior still puzzles me.

    As always, many thanks for your comments and thoughts.