in reply to Speed of opening and closing a file 1000s times

The following benchmark might be interesting:
#!/usr/bin/perl use strict; use warnings; use Fcntl qw /:seek/; use Benchmark qw /timethese cmpthese/; my $lines = shift || 50_000; my $file = "/tmp/big"; # First create a big data file. open my $fh => "> $file" or die "open: $!"; foreach (1 .. $lines) { foreach my $i (1 .. 3) { my $r = int rand 3; print $fh "abcdef" if $r == 0; print $fh "123456" if $r == 1; print $fh "<[{}]>" if $r == 2; } print $fh "\n"; } close $fh or die "close: $!"; my $words = "/tmp/words"; my $digits = "/tmp/digits"; my $punct = "/tmp/punct"; my @array = ([$words => '[a-z]'], [$digits => '[0-9]'], [$punct => '[^a-z0-9]']); sub many { foreach my $entry (@array) { open my $fh => ">", $entry -> [0] . ".m" or die "open: $!"; close $fh or die "close: $!"; } open my $fh => $file or die "open: $!"; while (<$fh>) { foreach my $entry (@array) { my ($f, $r) = @$entry; if (/$r/) { open my $fh1 => ">> $f.m" or die "open: $!"; print $fh1 $_; close $fh1 or die "close: $!"; } } } } sub one { open my $fh => $file or die "open: $!"; foreach my $entry (@array) { my ($f, $r) = @$entry; open my $fh1 => "> $f.o" or die "open: $!"; seek $fh => 0, SEEK_SET or die "seek: $!"; while (<$fh>) { print $fh1 $_ if /$r/ } close $fh1 or die "close: $!"; } } cmpthese -60 => { one => \&one, many => \&many, }; unlink $file or warn $!; unlink map {my $s = $_ -> [0]; ("$s.m", "$s.o")} @array; __END__ s/iter many one many 5.74 -- -95% one 0.267 2045% --

Going through the file repeatedly, and printing out the files one by one is a big winner.

Abigail

Replies are listed 'Best First'.
Re: Re: Speed of opening and closing a file 1000s times
by punchcard_don (Beadle) on Nov 03, 2003 at 20:15 UTC
    Some people hate this, but I like it and it works. Obviously limitd by memory, but I've parsed 8Mb text files like this no problem.

    This method may address issues if these are busy files. That is, leave the disk alone, do everything in RAM, then bother the disk when you're done.

    (For the purists, only the salient features are included, not die or flock etc.)

    #slurp the files into arrays open(FILE, 'filename_1'); @LINES_1 = <FILE>; close(FILE); open(FILE, 'filename_2'); @LINES_2 = <FILE>; close(FILE); open(FILE, 'filename_3'); @LINES_3 = <FILE>; close(FILE); # cycle through your file, pushing lines onto other files' arrays for $i (0 .. $#LINES_1) { if (condition_a) {push(@LINE_2, $LINES_1[$i]);}} if (condition_b) {push(@LINE_3, $LINES_1[$i]);}} } # write new arrays to files open(FILE, ">filename_2"); for my $i (0 .. $#LINES_2) {print FILE "$LINES_2[$i]";} close(FILE); open(FILE, ">filename_3"); for my $i (0 .. $#LINES_2) {print FILE "$LINES_3[$i]";} close(FILE);
Re: Re: Speed of opening and closing a file 1000s times
by tilly (Archbishop) on Nov 04, 2003 at 02:57 UTC
    That interests me less than the effect of throwing in the following:
    sub multiplex { my @my_array; foreach my $entry (@array) { open my $fh => ">", $entry->[0].".mu" or die "open: $!"; push @my_array, [$fh, $entry->[1]]; } open my $fh => $file or die "open: $!"; while (<$fh>) { foreach my $entry (@my_array) { print {$entry->[0]} $_ if /$entry->[1]/; } } foreach my $entry (@my_array) { close($entry->[0]) or die "close: $!"; } }
    After a couple of runs, it seems that the following is about what you get with multiple open filehandles that you juggle:
    s/iter multiplex one multiplex 1.65 -- -87% one 0.209 690% --
    which shows that not opening/closing multiple filehandles is a win, but it is much slower than repeated passes, and I have to wonder why. (One thing that I am wondering is how much caching files at the OS level could make a difference.)

    For the record, my trial was done on Debian with Perl 5.8.1.