Speed of opening and closing a file 1000s times

seaver has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Speed of opening and closing a file 1000s times by davido (Cardinal) on Nov 03, 2003 at 16:52 UTC
Could you not open the infile, and all of the possible outfiles before you begin iterating through the infile? Then you just write to whichever file handle you decide upon. And at the end of the entire process (which with just a thousand lines or so should be pretty quick) close all filehandles at once. Iterating through the same infile multiple times so as to be able to keep only one file opened at a time is a wasteful design, as is opening and immediately closing each outfile every time it is decided that something should be written to it. Open them all, do your work quickly, and then once you're done iterating, close them all. I wonder if your reason for wanting to open and immediately close each file was to circumvent the need for proper file locking. If so, that too is a seriously flawed design in an environment where the files may be needed by multiple processes. Dave "If I had my life to live over again, I'd be a plumber." -- Albert Einstein	[reply]
Re: Speed of opening and closing a file 1000s times by broquaint (Abbot) on Nov 03, 2003 at 16:54 UTC
Just open all the files that need opening then iterate through the file e.g `use IO::File; my %fhs = map { my $fh = IO::File->new($_, 'w') or die "ack [$_]: $!"; $_ => $fh; } @natures; while(<FILE>) { for my $match (keys %fhs) { print {$fhs{$match}} $_ if /$match/; } } close $_ for values %fhs;` [download] So that will only open the files once and iterate through the file once, which should do what you want. HTH `_________ broquaint`	[reply] [d/l]
Re: Re: Speed of opening and closing a file 1000s times by seaver (Pilgrim) on Nov 03, 2003 at 16:59 UTC
Perfect, thanks! S	[reply]
Re: Speed of opening and closing a file 1000s times by hanenkamp (Pilgrim) on Nov 03, 2003 at 17:00 UTC
I would suggest neither solution. Why not simply keep all the files open? Or if there are many, you could keep a cache of open files which closes files based upon how long ago they were used (an LRU cache) or some other kind of heuristic that can determine the likelihood the file will be used again soon (such as frequency of use). For example: `my %file_cache; sub open_file($) { my $filename = shift; return $file_cache{$filename} \|\| (open $file_cache{$filename}, ">>$filename" or die "blah blah $!"); } while (<FILE> { foreach my $n (@natures) { if ($_ =~ /%n/) { my $out = open_file "$n.txt"; print $out $n; } } }` [download] If there is the potential for many files, delete an entry from the cache managed by `open_file` when the given `$filename` isn't currently open and the cache is "full". This way you don't have to open files over and over again quite so often--we hope--and you don't have to perform reiterations.	[reply] [d/l] [select]
Re: Speed of opening and closing a file 1000s times by Abigail-II (Bishop) on Nov 03, 2003 at 17:34 UTC
The following benchmark might be interesting: #!/usr/bin/perl use strict; use warnings; use Fcntl qw /:seek/; use Benchmark qw /timethese cmpthese/; my $lines = shift \|\| 50_000; my $file = "/tmp/big"; # First create a big data file. open my $fh => "> $file" or die "open: $!"; foreach (1 .. $lines) { foreach my $i (1 .. 3) { my $r = int rand 3; print $fh "abcdef" if $r == 0; print $fh "123456" if $r == 1; print $fh "<[{}]>" if $r == 2; } print $fh "\n"; } close $fh or die "close: $!"; my $words = "/tmp/words"; my $digits = "/tmp/digits"; my $punct = "/tmp/punct"; my @array = ([$words => '[a-z]'], [$digits => '[0-9]'], [$punct => '[^a-z0-9]']); sub many { foreach my $entry (@array) { open my $fh => ">", $entry -> [0] . ".m" or die "open: $!"; close $fh or die "close: $!"; } open my $fh => $file or die "open: $!"; while (<$fh>) { foreach my $entry (@array) { my ($f, $r) = @$entry; if (/$r/) { open my $fh1 => ">> $f.m" or die "open: $!"; print $fh1 $_; close $fh1 or die "close: $!"; } } } } sub one { open my $fh => $file or die "open: $!"; foreach my $entry (@array) { my ($f, $r) = @$entry; open my $fh1 => "> $f.o" or die "open: $!"; seek $fh => 0, SEEK_SET or die "seek: $!"; while (<$fh>) { print $fh1 $_ if /$r/ } close $fh1 or die "close: $!"; } } cmpthese -60 => { one => \&one, many => \&many, }; unlink $file or warn $!; unlink map {my $s = $_ -> [0]; ("$s.m", "$s.o")} @array; __END__ s/iter many one many 5.74 -- -95% one 0.267 2045% -- [download] Going through the file repeatedly, and printing out the files one by one is a big winner. Abigail	[reply] [d/l]
Re: Re: Speed of opening and closing a file 1000s times by punchcard_don (Beadle) on Nov 03, 2003 at 20:15 UTC
Some people hate this, but I like it and it works. Obviously limitd by memory, but I've parsed 8Mb text files like this no problem. This method may address issues if these are busy files. That is, leave the disk alone, do everything in RAM, then bother the disk when you're done. (For the purists, only the salient features are included, not die or flock etc.) #slurp the files into arrays open(FILE, 'filename_1'); @LINES_1 = <FILE>; close(FILE); open(FILE, 'filename_2'); @LINES_2 = <FILE>; close(FILE); open(FILE, 'filename_3'); @LINES_3 = <FILE>; close(FILE); # cycle through your file, pushing lines onto other files' arrays for $i (0 .. $#LINES_1) { if (condition_a) {push(@LINE_2, $LINES_1[$i]);}} if (condition_b) {push(@LINE_3, $LINES_1[$i]);}} } # write new arrays to files open(FILE, ">filename_2"); for my $i (0 .. $#LINES_2) {print FILE "$LINES_2[$i]";} close(FILE); open(FILE, ">filename_3"); for my $i (0 .. $#LINES_2) {print FILE "$LINES_3[$i]";} close(FILE); [download]	[reply] [d/l]
Re: Re: Speed of opening and closing a file 1000s times by tilly (Archbishop) on Nov 04, 2003 at 02:57 UTC
That interests me less than the effect of throwing in the following: `sub multiplex { my @my_array; foreach my $entry (@array) { open my $fh => ">", $entry->[0].".mu" or die "open: $!"; push @my_array, [$fh, $entry->[1]]; } open my $fh => $file or die "open: $!"; while (<$fh>) { foreach my $entry (@my_array) { print {$entry->[0]} $_ if /$entry->[1]/; } } foreach my $entry (@my_array) { close($entry->[0]) or die "close: $!"; } }` [download] After a couple of runs, it seems that the following is about what you get with multiple open filehandles that you juggle: `s/iter multiplex one multiplex 1.65 -- -87% one 0.209 690% --` [download] which shows that not opening/closing multiple filehandles is a win, but it is much slower than repeated passes, and I have to wonder why. (One thing that I am wondering is how much caching files at the OS level could make a difference.) For the record, my trial was done on Debian with Perl 5.8.1.	[reply] [d/l] [select]
Re: Speed of opening and closing a file 1000s times by Abigail-II (Bishop) on Nov 03, 2003 at 16:59 UTC
The first code fragment (which I think belongs to your second alternative) should be faster. Not only because you do less in the inner loop, but you allow far more filesystem caching and perl doesn't need to compile a new regex at each inner iteration. Abigail	[reply]
Re: Speed of opening and closing a file 1000s times by Nkuvu (Priest) on Nov 03, 2003 at 17:02 UTC
Update: Hrm, I guess I'm a very slow typer today. Opening and closing files once for every line? Why in the world would you want to to that? Consider the following: `#!/usr/bin/perl -w use strict; my $line; open INPUT_FILE, "input_filename" or die "Wonky!\n"; open FIRST_FILE, ">filename1" or die "Hork!\n"; open SECOND_FILE, ">filename2" or die "Foo!\n"; open THIRD_FILE, ">filename3" or die "Spew!\n"; while ($line = <INPUT_FILE>) { if $line =~ /first_criteria/ { print FIRST_FILE $line; } else if $line =~ /secod_criteria/ { print SECOND_FILE $line; } # And so on. }` [download] But I have no idea where the `@natures` array comes from. Is that your criteria for sorting the lines?	[reply] [d/l] [select]
Re: Speed of opening and closing a file 1000s times by zengargoyle (Deacon) on Nov 03, 2003 at 20:15 UTC
see also: FileCache which can simplify handling multiple files.	[reply]
Re: Speed of opening and closing a file 1000s times by ysth (Canon) on Nov 03, 2003 at 20:25 UTC
In your case, I would second the advice to open all the output files up front and write to whichever is appropriate with the caveat that you try not to recompile the regex for each possibility (i.e. store the regexen as qr// objects). In the case of larger numbers of output files, I would create a structure something like: `@natures = [{ filename => "h.ll", criterion => qr/blasphemer/, handle => undef # this line not actually needed }, ...` [download] and open each handle as needed, with a loop at the end over natures to close them all. Hey, wait, you could even have some kind of object encapsulating that hash, and let an object method do the match. But that starts to sound like Mail::ListDelivery/ Mail::Audit::List.	[reply] [d/l]