in reply to Super fast file creation needed

That seems like a very round-about way to check for duplicate entries, and also a very very slow one (especially on filesystems that grow much slower as the number of files in a single directory increases, and many do).

How about this construct: instead of testing for and creating files, put an entry in a hash and test for that (I'm reusing your $filename variable here, except I'm not using it as a filename):

my %seen; # declare this outside the loop # do the following for each $filename if ($seen{$filename}++) { # $filename is a duplicate } else { # never seen this before }
Theoretically, that would run out of memory some time, but I can't imagine a situation where your file-based strategy would be any better. If you've got a Gb of RAM you should be able to store a couple of million "filenames" at least

updated: removed exists, added some comments

Replies are listed 'Best First'.
Re^2: Super fast file creation needed
by ikegami (Patriarch) on Oct 18, 2007 at 23:40 UTC
    If the memory requirements are too big, then the hash could be substituted with a DBM hash.
Re^2: Super fast file creation needed
by Plankton (Vicar) on Oct 18, 2007 at 23:30 UTC
    Whoa gnarly dude! That is totally way faster! Here's the details on what I am doing in case anyone spots something totally stupid on my part ...
    #!/usr/local/bin/perl -w use strict; my %seen; my @logfiles = glob ( "*access_log" ); foreach my $logfile (@logfiles) { open ( MYLOG, ">>progress_safe" ); print MYLOG "Doing $logfile\n"; close ( MYLOG ); process_file( $logfile ); } sub process_file { my $fn = shift; open ( FH, "<$fn" ); while (<FH>) { chomp; s/\W/_/g; my $new_empty_file = substr( $_, 0, 200 ); my $target = "$new_empty_file"; if ( $seen{$target} && ($fn ne $seen{$target}) ) { $seen{$target} = "$seen{$target} and $fn both +have :$target\n"; open ( DUP, ">>dups_found_safe" ); print DUP "$seen{$target}"; close DUP; } else { $seen{$target} = "$fn"; } } }
    ... thanks Joost and Monks!

      Opening and closing your files for appending inside the loop is unnecessary. You won't get a huge speed boost by moving the opens out of the loops, but it will be an improvement.