Plankton has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I hope you can help. I need a way faster way of creating files. Here's what I am doing. I am looking for duplicate entries in web server log files. These are very large files so to find duplicates I am opening a log file reading a line from that log file and creating a string/filename based on the data on that line. If a file with this name already exist I know I have found a duplicate log entry, otherwise I create a file with this name. I was doing ...
system( "echo $filename > lines/$filename" );
... this is way slow. I found that doing ...
open( FH, ">lines/$filename ); print FH $filename; close FH;
... is faster. I was hoping that a Monk out there might even know a super-way-faster way of doing this, my boss is way impatience :( I hope I didn't use "way" way too much :) I got this surfer dude thing going on ... very strange.

Replies are listed 'Best First'.
Re: Super fast file creation needed
by Joost (Canon) on Oct 18, 2007 at 22:41 UTC
    That seems like a very round-about way to check for duplicate entries, and also a very very slow one (especially on filesystems that grow much slower as the number of files in a single directory increases, and many do).

    How about this construct: instead of testing for and creating files, put an entry in a hash and test for that (I'm reusing your $filename variable here, except I'm not using it as a filename):

    my %seen; # declare this outside the loop # do the following for each $filename if ($seen{$filename}++) { # $filename is a duplicate } else { # never seen this before }
    Theoretically, that would run out of memory some time, but I can't imagine a situation where your file-based strategy would be any better. If you've got a Gb of RAM you should be able to store a couple of million "filenames" at least

    updated: removed exists, added some comments

      If the memory requirements are too big, then the hash could be substituted with a DBM hash.
      Whoa gnarly dude! That is totally way faster! Here's the details on what I am doing in case anyone spots something totally stupid on my part ...
      #!/usr/local/bin/perl -w use strict; my %seen; my @logfiles = glob ( "*access_log" ); foreach my $logfile (@logfiles) { open ( MYLOG, ">>progress_safe" ); print MYLOG "Doing $logfile\n"; close ( MYLOG ); process_file( $logfile ); } sub process_file { my $fn = shift; open ( FH, "<$fn" ); while (<FH>) { chomp; s/\W/_/g; my $new_empty_file = substr( $_, 0, 200 ); my $target = "$new_empty_file"; if ( $seen{$target} && ($fn ne $seen{$target}) ) { $seen{$target} = "$seen{$target} and $fn both +have :$target\n"; open ( DUP, ">>dups_found_safe" ); print DUP "$seen{$target}"; close DUP; } else { $seen{$target} = "$fn"; } } }
      ... thanks Joost and Monks!

        Opening and closing your files for appending inside the loop is unnecessary. You won't get a huge speed boost by moving the opens out of the loops, but it will be an improvement.

Re: Super fast file creation needed
by thezip (Vicar) on Oct 18, 2007 at 22:49 UTC
    1. What constitutes a duplicate? It seems to me that each line would have to be unique due to the timestamps and chronological nature of logs files. (Perhaps these are overlapping logfiles?)

    2. If the logfiles are overlapping, then instead of creating all the files, why not just calculate a checksum for each line and then stuff that into a hash? You could then check for the checksum's existence to see if that line has already been seen.

    I apologize in advance if this is an oversimplification of the problem. Perhaps you can provide more detail?


    Where do you want *them* to go today?
      For #2, I don't really see the point of hashing the lines (with a checksum) and then storing them in a hash (which will re-hash the hashed values). The only reason I can think of would be an attempt to speed lookups by using the checksum as a shorter hash key, but I would expect the extra time spent computing the checksums to overshadow any gains in lookup time. And then there's also the question of possible hash collisions in the checksums, which means more wasted time on redundantly handling that (since Perl hashes already have collision handling for their hashed keys).

      Just using the log lines as your hash keys directly seems simpler, faster, and more reliable, unless I'm missing something here. Am I?

        I understood that the logfiles were huge, which IMHO, makes storing the entire lines as hash keys impractical due to memory considerations.

        Sure, computing checksums/digests might slow things down some, but it is one way to identify whether a line has been seen or not. With the proper digest length, hash key collisions could be virtually eliminated.

        In this case, I think the memory considerations outweigh the speed considerations, but it would certainly be prudent to benchmark both ways to see which one works better.


        Where do you want *them* to go today?