aboyd has asked for the wisdom of the Perl Monks concerning the following question:

I have 3 big files -- about 1 GB each. Each one is a Web server log file from a given year. I want to break them out monthly. Which do you suggest:

1. while loop through the log files, regex matching on month/year, opening a month/year filehandle if it's not open already, and writing the line out. By the end of this process, Perl will have 36 filehandles open for writing, and 1 open for reading. Will the size of the files matter? Or will having open 36 files of 100 MB each be the same as 36 files of 10 K each?

2. loop through the month/year combinations, opening only the files needed per month. At any given point, there would only be 1 file open for reading, and 1 file open for writing. However, there would be 36 loops.

3. Some other option? My amateurish experiments so far show that the process will take hours. I'm looking to remove my regex matching and replace it with index -- I suspect that's faster. Is there anything else you might offer as advice?

-Tony

  • Comment on Opine on multiple open filehandles vs. multiple loops through data

Replies are listed 'Best First'.
Re: Opine on multiple open filehandles vs. multiple loops through data
by ikegami (Patriarch) on Jan 20, 2006 at 21:02 UTC

    Since the log files are sorted, there's no reason to have more than two handles open.

    while (my $logfile = glob('access-????.log')) { open(my $logfile_fh, '<', $logfile) or die("Unable to open yearly logfile $logfile: $!\n"); my ($year) = $logfile =~ /^access-(....)\.log$/; my $output_fh; my $month = -1; while (<$logfile_fh>) { # Extract month of log entry. my ($new_month) = /$year-(..)/; if ($new_month != $month) { $month = $new_month; my $output_file = sprintf('access-%04d-%02d.log', $year, $mon +th); open($output_fh, '>', $output_file) or die("Unable to create monthly logfile $output_file: $!\ +n"); } print $output_fh $_; } }
      Sadly, the files aren't perfectly sorted. At the month changeover, there are dozens of lines out of order. I assume that the server wrote the lines in batches, and didn't bother with FIFO.

      However, the bulk of each month is ordered properly. I think trying it your way would be fine -- even if it has to close/open/reopen files 20 times during the change of each month, that's still probably A-OK. Overall, that's just maybe 750 open/closes. Oh. Maybe that is a bit much. Hmm.

      Well, it's something to test out, to see how speedy it is. Thank you. :)

Re: Opine on multiple open filehandles vs. multiple loops through data
by Fletch (Bishop) on Jan 20, 2006 at 20:59 UTC

    For any halfway decent modern OS (or even Windows for that matter . . .) 36 open files should be no problem. The size of the file itself doesn't enter into the equation since all filehandles are underneath is (more or less) a few pointers; the storage requirements for one are miniscule. Granted if you were talking about several more orders of magnitude (thousands of handles) you might run into problems and need to look at something like FileCache, but for this many it shouldn't be a problem.

    Update: Oop I didn't even think about it (stayed up too late running SM 4-man last night . . . :), but as is pointed out below each file should already be sorted so you just read until the date changes and then open the new output file. Now if you had multiple files you wanted to merge (say from n separate webservers) into one consolidated log file you would need multiple open files for reading, but again just one output file would serve.

Re: Opine on multiple open filehandles vs. multiple loops through data
by Limbic~Region (Chancellor) on Jan 21, 2006 at 15:35 UTC
    aboyd,
    Since your concern seems to be with speed, I would recommend taking a look at Performance Trap - Opening/Closing Files Inside a Loop. In a nutshell, the most cost effective approach is to avoid opening/closing filehandles and potentially to also buffer writes so that you reduce your biggest performance problem - IO.

    Cheers - L~R

Re: Opine on multiple open filehandles vs. multiple loops through data
by samtregar (Abbot) on Jan 22, 2006 at 15:57 UTC
    Your first option sounds optimal to me, although only benchmarking can prove it one way or another. 36 filehandles should be no problem and it shouldn't matter how much data you write to them.

    -sam