seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I am simply opening one file, reading each line, and according to the nature of the line, appending it to one of 2 or 3 files, thus 'sorting' the file into seperate files.

I can either, go through the original FILE 3 times, and each time, only write the lines of one particular nature to a file.

foreach my $n (@natures){ open (OUT, "> $n.txt") or die "blah blah $! \n" while(<FILE>){ print OUT $_ if $_ =~ /$n/; } }
OR, I can do through FILE once, but at each line, check the nature, and append to the appropiate file:
while(<FILE>){ foreach my $n (@natures){ if($_ =~ /$n/){ open (OUT, ">> $n.txt") or die "blah blah $! \n" print OUT $n; close OUT; } } }
I was wondering, seeing FILE has thousands of lines, whether, reiterating through those thousands would be any quicker than opening and closing several files once for every line...

Cheers
S

Replies are listed 'Best First'.
Re: Speed of opening and closing a file 1000s times
by davido (Cardinal) on Nov 03, 2003 at 16:52 UTC
    Could you not open the infile, and all of the possible outfiles before you begin iterating through the infile? Then you just write to whichever file handle you decide upon. And at the end of the entire process (which with just a thousand lines or so should be pretty quick) close all filehandles at once.

    Iterating through the same infile multiple times so as to be able to keep only one file opened at a time is a wasteful design, as is opening and immediately closing each outfile every time it is decided that something should be written to it. Open them all, do your work quickly, and then once you're done iterating, close them all.

    I wonder if your reason for wanting to open and immediately close each file was to circumvent the need for proper file locking. If so, that too is a seriously flawed design in an environment where the files may be needed by multiple processes.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: Speed of opening and closing a file 1000s times
by broquaint (Abbot) on Nov 03, 2003 at 16:54 UTC
    Just open all the files that need opening then iterate through the file e.g
    use IO::File; my %fhs = map { my $fh = IO::File->new($_, 'w') or die "ack [$_]: $!"; $_ => $fh; } @natures; while(<FILE>) { for my $match (keys %fhs) { print {$fhs{$match}} $_ if /$match/; } } close $_ for values %fhs;
    So that will only open the files once and iterate through the file once, which should do what you want.
    HTH

    _________
    broquaint

      Perfect, thanks!

      S

Re: Speed of opening and closing a file 1000s times
by hanenkamp (Pilgrim) on Nov 03, 2003 at 17:00 UTC

    I would suggest neither solution. Why not simply keep all the files open? Or if there are many, you could keep a cache of open files which closes files based upon how long ago they were used (an LRU cache) or some other kind of heuristic that can determine the likelihood the file will be used again soon (such as frequency of use). For example:

    my %file_cache; sub open_file($) { my $filename = shift; return $file_cache{$filename} || (open $file_cache{$filename}, ">>$filename" or die "blah blah $!"); } while (<FILE> { foreach my $n (@natures) { if ($_ =~ /%n/) { my $out = open_file "$n.txt"; print $out $n; } } }

    If there is the potential for many files, delete an entry from the cache managed by open_file when the given $filename isn't currently open and the cache is "full".

    This way you don't have to open files over and over again quite so often--we hope--and you don't have to perform reiterations.

Re: Speed of opening and closing a file 1000s times
by Abigail-II (Bishop) on Nov 03, 2003 at 17:34 UTC
    The following benchmark might be interesting:
    #!/usr/bin/perl use strict; use warnings; use Fcntl qw /:seek/; use Benchmark qw /timethese cmpthese/; my $lines = shift || 50_000; my $file = "/tmp/big"; # First create a big data file. open my $fh => "> $file" or die "open: $!"; foreach (1 .. $lines) { foreach my $i (1 .. 3) { my $r = int rand 3; print $fh "abcdef" if $r == 0; print $fh "123456" if $r == 1; print $fh "<[{}]>" if $r == 2; } print $fh "\n"; } close $fh or die "close: $!"; my $words = "/tmp/words"; my $digits = "/tmp/digits"; my $punct = "/tmp/punct"; my @array = ([$words => '[a-z]'], [$digits => '[0-9]'], [$punct => '[^a-z0-9]']); sub many { foreach my $entry (@array) { open my $fh => ">", $entry -> [0] . ".m" or die "open: $!"; close $fh or die "close: $!"; } open my $fh => $file or die "open: $!"; while (<$fh>) { foreach my $entry (@array) { my ($f, $r) = @$entry; if (/$r/) { open my $fh1 => ">> $f.m" or die "open: $!"; print $fh1 $_; close $fh1 or die "close: $!"; } } } } sub one { open my $fh => $file or die "open: $!"; foreach my $entry (@array) { my ($f, $r) = @$entry; open my $fh1 => "> $f.o" or die "open: $!"; seek $fh => 0, SEEK_SET or die "seek: $!"; while (<$fh>) { print $fh1 $_ if /$r/ } close $fh1 or die "close: $!"; } } cmpthese -60 => { one => \&one, many => \&many, }; unlink $file or warn $!; unlink map {my $s = $_ -> [0]; ("$s.m", "$s.o")} @array; __END__ s/iter many one many 5.74 -- -95% one 0.267 2045% --

    Going through the file repeatedly, and printing out the files one by one is a big winner.

    Abigail

      Some people hate this, but I like it and it works. Obviously limitd by memory, but I've parsed 8Mb text files like this no problem.

      This method may address issues if these are busy files. That is, leave the disk alone, do everything in RAM, then bother the disk when you're done.

      (For the purists, only the salient features are included, not die or flock etc.)

      #slurp the files into arrays open(FILE, 'filename_1'); @LINES_1 = <FILE>; close(FILE); open(FILE, 'filename_2'); @LINES_2 = <FILE>; close(FILE); open(FILE, 'filename_3'); @LINES_3 = <FILE>; close(FILE); # cycle through your file, pushing lines onto other files' arrays for $i (0 .. $#LINES_1) { if (condition_a) {push(@LINE_2, $LINES_1[$i]);}} if (condition_b) {push(@LINE_3, $LINES_1[$i]);}} } # write new arrays to files open(FILE, ">filename_2"); for my $i (0 .. $#LINES_2) {print FILE "$LINES_2[$i]";} close(FILE); open(FILE, ">filename_3"); for my $i (0 .. $#LINES_2) {print FILE "$LINES_3[$i]";} close(FILE);
      That interests me less than the effect of throwing in the following:
      sub multiplex { my @my_array; foreach my $entry (@array) { open my $fh => ">", $entry->[0].".mu" or die "open: $!"; push @my_array, [$fh, $entry->[1]]; } open my $fh => $file or die "open: $!"; while (<$fh>) { foreach my $entry (@my_array) { print {$entry->[0]} $_ if /$entry->[1]/; } } foreach my $entry (@my_array) { close($entry->[0]) or die "close: $!"; } }
      After a couple of runs, it seems that the following is about what you get with multiple open filehandles that you juggle:
      s/iter multiplex one multiplex 1.65 -- -87% one 0.209 690% --
      which shows that not opening/closing multiple filehandles is a win, but it is much slower than repeated passes, and I have to wonder why. (One thing that I am wondering is how much caching files at the OS level could make a difference.)

      For the record, my trial was done on Debian with Perl 5.8.1.

Re: Speed of opening and closing a file 1000s times
by Abigail-II (Bishop) on Nov 03, 2003 at 16:59 UTC
    The first code fragment (which I think belongs to your second alternative) should be faster. Not only because you do less in the inner loop, but you allow far more filesystem caching *and* perl doesn't need to compile a new regex at each inner iteration.

    Abigail

Re: Speed of opening and closing a file 1000s times
by Nkuvu (Priest) on Nov 03, 2003 at 17:02 UTC

    Update: Hrm, I guess I'm a very slow typer today.

    Opening and closing files once for every line? Why in the world would you want to to that? Consider the following:

    #!/usr/bin/perl -w use strict; my $line; open INPUT_FILE, "input_filename" or die "Wonky!\n"; open FIRST_FILE, ">filename1" or die "Hork!\n"; open SECOND_FILE, ">filename2" or die "Foo!\n"; open THIRD_FILE, ">filename3" or die "Spew!\n"; while ($line = <INPUT_FILE>) { if $line =~ /first_criteria/ { print FIRST_FILE $line; } else if $line =~ /secod_criteria/ { print SECOND_FILE $line; } # And so on. }

    But I have no idea where the @natures array comes from. Is that your criteria for sorting the lines?

Re: Speed of opening and closing a file 1000s times
by zengargoyle (Deacon) on Nov 03, 2003 at 20:15 UTC

    see also: FileCache which can simplify handling multiple files.

Re: Speed of opening and closing a file 1000s times
by ysth (Canon) on Nov 03, 2003 at 20:25 UTC
    In your case, I would second the advice to open all the output files up front and write to whichever is appropriate with the caveat that you try not to recompile the regex for each possibility (i.e. store the regexen as qr// objects). In the case of larger numbers of output files, I would create a structure something like:
    @natures = [{ filename => "h.ll", criterion => qr/blasphemer/, handle => undef # this line not actually needed }, ...
    and open each handle as needed, with a loop at the end over natures to close them all.

    Hey, wait, you could even have some kind of object encapsulating that hash, and let an object method do the match. But that starts to sound like Mail::ListDelivery/ Mail::Audit::List.