Viking has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a log parser that will be parsing log files that are many megabytes large. When I open a file is the file read into memory or is it read from disk?
eg:
open FILE, "somefile" or die $!; while (<FILE>) { # do stuff } close FILE;

Replies are listed 'Best First'.
Re (tilly) 1: open, file handles and memory
by tilly (Archbishop) on Jan 13, 2001 at 20:03 UTC
    On every operating system unless you are accumulating memory usage in your loop that pattern will work. Just be careful, operations that impose a list-context on the filehandle will slurp the whole thing into an array. So if memory is an issue, avoid the following kinds of things:
    foreach my $line (<FILE>) { # etc } my @lines = sort <FILE>; print <FILE>;
    Also it is a picky detail, but I find it very helpful to instead of just dying have the die message contain full context information like it recommends in perlstyle.
    open(FILE, "< $file") or die "Cannot read $file: $!";
    (Or use Carp and confess() rather than die.)

    A probably useless tip. Occasionally you run across a situation where you want to process large files (eg 40 GB each) and Perl does not have large file support compiled in. In that case do your reads like this:

    open(FILE, "cat $file |") or die "Cannot read $file: $!";
    As long as cat understands large files, Perl understands endless pipes, and this works smoothly.
Re: open, file handles and memory
by mwp (Hermit) on Jan 13, 2001 at 17:42 UTC
    As far as I know, that really depends on the operating system and how it handles open file handles. However, most of the time it will buffer the file--read in one line at a time. I wouldn't worry about it.

    Just don't do something like this:

    open(FILE, "somefile") or die $!; my @log = <FILE>; # read entire file into RAM close(FILE); foreach my $line (@log) { # do stuff to $line }
    That will definitely clutter up your machine's memory!

    On the other hand, if you have a gigabyte of RAM and you WANT to load the entire file instead of using slow disk accesses, knock yourself out. =)

      That's what I want to avoid, speed isn't an issue (well within reason), having my server grind to a halt because of it running out of memory would be!! :)
Re: open, file handles and memory
by repson (Chaplain) on Jan 13, 2001 at 18:04 UTC
    while (<FILE>) is fine, just make sure it's not for (<FILE>) which _would_ load the file into memory, but otherwise seem mostly identical to while.

    It's unlikly to be significant for parsing logfiles, but sometimes it's not optimal to go line by line, since this increases i/o actions, which in general are slow. On the other hand slurping the WHOLE file into memory will require lots of ram. You can strike a compromise by using reads of maybe half a megabyte (I don't know whats optimal, but that seems sensiblish to me :) you can process the file in larger chunks with more infrequent disk access, so a good change of a faster runtime. However this will make it marginally more complicated to write.

      When you instruct the Perl interpreter to read the next line of the file, it can't know in advance how long the line will be, so it's logical that it itself reads larger chunks at a time, and buffers the input, just like the OS does (YMMV).
        Performing an 'strace' on this code:
        open(F, "<README"); # arbitrary $|=1; while(<F>) { print "line $.\n"; }
        Results in this:
        open("README", O_RDONLY|O_LARGEFILE) = 3 read(3, "\t\t GNU GENERAL PUBLIC LICENSE"..., 4096) = 4096 # firs +t block write(1, "line 1\n", 7line 1 write(1, "line 2\n", 7line 2 ... write(1, "line 80\n", 8line 80 read(3, " and appropriately publish on ea"..., 4096) = 4096 # seco +nd block write(1, "line 81\n", 8line 81
        So it does appear to buffer the data in chunks, but they seem to be managably sized. This too may differ depending upon OS or build of Perl.