in reply to Re: Write large array to file, very slow
in thread Write large array to file, very slow

The ">>" mode was used precisely because the file is constantly being reopened. But with your ++proposition, the preceding unlink can be removed, to let ">" overwrite the file instead.

Besides, the 3 args version of open with a scalar can be used for many reasons (elegance, safety ...) but at the very least for consistency with the way the input files are opened.

my $output_file = "mergedlogs.txt"; open my $output_fh, ">", $output_file or die "Can't open $output_file: + $!"; { local $| = 0; local $\ = "\n"; # Automatically append \n foreach (@mergedlogs) { print $output_fh $_; # "$_\n" copies $_ into a new string before a +ppending \n } } close $output_fh;
Although Corion's proposition to write the result straight to the file, without the @mergedlogs intermediate variable is probably a good idea as well.

Replies are listed 'Best First'.
Re^3: Write large array to file, very slow
by hippo (Archbishop) on Aug 20, 2018 at 15:24 UTC

    Just ran a quick bench and your suggestion to avoid the copy (++) works very well - it's about twice as fast as my code above. As another test I also tried local $, = "\n"; print $output_fh @mergedlogs; but that's no faster to within statistical noise. I'll run a longer bench later just to see if it's at all significant.

      "Twice as fast" seems like a lot for in memory operations when there are also disk accesses. I'm sure there are plenty of things to consider (HW, and data size), but with the following code I couldn't get past a difference of around ~5% (although I did notice that trying that with supidly big files made my computer crash :P):

      use v5.20; use strict; use warnings; use Benchmark qw( cmpthese ); use Data::Dump qw( pp ); my $size = 10; my $length = 1E6; my @data = ('X' x $length, ) x $size; sub write_copy { open my $fh, ">", "tmp.txt" or die "Can't open output file $!"; $| = 0; my $data = shift; for (@$data) { print $fh "$_\n"; } } sub write_simple { local $\ = "\n"; open my $fh, ">", "tmp.txt" or die "Can't open output file $!"; $| = 0; my $data = shift; for (@$data) { print $fh $_; } } cmpthese( -15, { copy => sub { write_copy(\@data); }, simple => sub { write_simple(\@data); }, } ); __END__ Rate copy simple copy 27.3/s -- -5% simple 28.8/s 5% --

        "Twice as fast" seems like a lot for in memory operations when there are also disk accesses.

        Yes, I thought so too. Looks like my data set was so large it ate into swap. :-)

        Re-running with a smaller data set still shows quite a decent speed up, however. Here's my bench and results:

        #!/usr/bin/env perl use strict; use warnings; use Benchmark 'cmpthese'; my $size = 50_000_000; my @big = (rand () x $size); cmpthese (10, { 'interp' => 'interp ()', 'Eily' => 'eily ()', 'OFS' => 'ofs ()', }); exit; sub interp { open FH, '>', 'mergedlogs.txt' or die "can't open mergedlogs.txt: +$!"; local $| = 0; foreach (@big) { print FH "$_\n"; } close FH; } sub eily { my $output_file = "mergedlogs.txt"; open my $output_fh, ">", $output_file or die "Can't open $output_f +ile: $!"; local $| = 0; local $\ = "\n"; foreach (@big) { print $output_fh $_; } close $output_fh; } sub ofs { my $output_file = "mergedlogs.txt"; open my $output_fh, ">", $output_file or die "Can't open $output_f +ile: $!"; local $| = 0; local $\ = "\n"; local $, = "\n"; print $output_fh @big; close $output_fh; }
        s/iter interp Eily OFS interp 1.83 -- -35% -35% Eily 1.20 53% -- -1% OFS 1.19 54% 1% --