in reply to best way to fast write to large number of files

Depending on your operating system (and potentially, network setup), opening and closing a file are relatively expensive operations. As you seem to have all the data in memory already, it might be faster to sort the data according to each customer and then print it out in one go for each customer:

my %files; foreach my $CDR (@RECList) { my ($filename,$row)= split(/,/, $CDR); $files{ $filename } ||= []; # start with empty array push @{ $files{ $filename }}, $row; # append row to that array }; # Now print out the data for my $filename (sort keys %files) { open my $csv_fh, '>>', "/ClientRrecord/$filename.csv") or die "cou +ldn't open [$filename.csv]\n".$!; print { $csv_fh } map { "$_\n" } @$row; # print out all lines (wit +h newlines) };

Replies are listed 'Best First'.
Re^2: Delay when write to large number of file
by thargas (Deacon) on Jun 24, 2014 at 11:36 UTC

    Since open/close is too slow, perhaps you might try a database, like sqlite?

    I made a small test program and the output seems to indicate that it would be significantly faster:

    C:\> perl trymany.pl connecting to dbi:SQLite:db.sqlite3 ... connected to dbi:SQLite:db.sqlite3 ready to begin Rate openclose sqlite openclose 2580/s -- -98% sqlite 121065/s 4593% --

    The code is:

    #!/usr/bin/perl # trymany - compare open/write/close with db access #vim: syntax=perl use v5.14; use warnings; use Benchmark qw( :all ); use File::Path qw( make_path ); use DBI; sub make_file_name { #FUNCTION $top -> $name my ($top) = @_; my $f = $top . '/' . sprintf "%04d", rand(10000); return $f; } sub make_dir { #FUNCTION $dir -> $dir my ($dir) = @_; return $dir if (-d $dir); my $ok = make_path($dir) or die "cannot mkdir $dir: $!\n"; return $dir; } my $db_file = "db.sqlite3"; my $dsn = "dbi:SQLite:$db_file"; my $table = "testtab"; my $column = "data"; my $insert_sql = "insert into $table ($column) values (?)"; my $create_sql = "create table $table ($column varchar)"; my $commit_every = 1000; my $uncommitted = 0; my $record = 'x' x 80; my $create_table = (-f $db_file) ? 0 : 1; my $top = "dirs"; make_dir($top) or die "cannot mkdir $top $!\n"; my $n = (shift @ARGV) || 100000; my $seed = (shift @ARGV) || 12523477; srand($seed); warn "connecting to $dsn ...\n"; my $dbh = DBI->connect($dsn, '', '', { AutoCommit => 0, PrintError => 1, RaiseError => 1, }) or die "cannot connect to $dsn: $!\n"; warn "connected to $dsn\n"; if ($create_table) { $dbh->do($create_sql) or die "cannot create table $DBI::errstr +\n"; warn "created table\n"; } my $sth = $dbh->prepare($insert_sql) or die "cannot prepare: $DBI::err +str\n"; my $first_sqlite = 1; warn "ready to begin\n"; cmpthese( $n, { openclose => sub { state $dir = make_dir("$top/openclose"); my $f = make_file_name($dir); open my $fh, ">>$f" or die "cannot open $f for append; + $!\n"; defined(print $fh $record) or die "cannot write $f: $! +\n";; close($fh) or die "cannot close $f: $!\n"; }, sqlite => sub { $sth->execute($record) or die "cannot insert: $DBI::er +rstr\n"; ++$uncommitted; if ($uncommitted >= $commit_every) { $dbh->commit or die "cannot commit $DBI::errst +r\n"; $uncommitted = 0; } }, });

      It seems to me that the point of the program of the OP is to create reports for different customers. I'm not sure how creating one database file with all the customer data will help them.

        It won't work if he insists on processing all the files each time the report is requested, but I doubt that anything will. I assumed, perhaps incorrectly, that collecting the data by customer was a background process.

        I figured that with all the data in the database, making the reports would be easy and fast, assuming that you indexed the table properly. It's possible this doesn't scale either, but the test program can easily be tweaked to tell whether it will.

Re^2: Delay when write to large number of file
by Hosen1989 (Scribe) on Jun 23, 2014 at 14:57 UTC

    Dear Corion, thanks for replay.

    its good idea if we have low number of client.

    The problem is we have more than 5M subscriber (with more than 2.7M active client), and we have Continuous input of logs (about 4~5 file every minute), with every log file has ~18K raw.

    so its very rare to have client made more than 3 record in the same log file, which mean it make no much difference if sorting the data or not.

    -- but it will be good idea if we add more than one log file together in that list @RECList.

    -- now, another Question. how many raw we can add the array? or what is the limit size of array in perl?

    also, I know that opreating system has some fault in this (windows 7 & windows server 2008) we relay wish to use linux but cannot :(

    and I know the slowness are from open and close too many file, I try to find any other way to write to files other than this but couldn't find any (I'm beginner in perl ~ but I like it very much ^_^).

    BR

    Hosen

      As it currently is, you are doing 18k open+close per file. Open and close are slow on Windows. With my approach, you will reduce the number of open+close. If you process more than one file before writing the output, you can reduce the number of open+close per client even more.

      Perl has no limit for the array size other than available memory.

      This thought may be way off topic here, but if you can't change OS, and for some reason can't employ Corions solution, perhaps you might consider the hardware itself? There are some SSD (Solid State Drives) which reportedly increase performance 100x over spinning drives. If one was to carefully select an SSD that has been test proven to perform with your flavor of OS, you might see some gain that way. If I understand your numbers right, even a 10x performance increase would help you out.

      just a thought...

      (honestly, I think the solution that Corion referred to is the way to go)

      ...the majority is always wrong, and always the last to know about it...

      Insanity: Doing the same thing over and over again and expecting different results...

      A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct

      What corion said :)

      On my old 2006 laptop with 3500rpm harddisk ... processing/printing 18k records with a single open/close takes under four seconds consistently

      Doing an extra 18k open/close it takes twice as long or longer ( 7-27)

        Dear Friend,

        Are you say that the next script will finish in less than 30 Sec (with 18K open/close)?

        @RECList = ..... clientname,record sam,plaplapla jame,bobobo kate,sososo ..... print "FLASH A-LIST\n"; foreach my $CDR (@RECList){ my ($filename,$row)= split(/,/, $CDR); open(my $csv_fh, ">> /ClientRrecord/$filename.csv") or die "couldn +\'t open [$filename.csv]\n".$!; print { $csv_fh } $row."\n"; }