Re: Delay when write to large number of file
by Corion (Patriarch) on Jun 23, 2014 at 08:15 UTC
|
Depending on your operating system (and potentially, network setup), opening and closing a file are relatively expensive operations. As you seem to have all the data in memory already, it might be faster to sort the data according to each customer and then print it out in one go for each customer:
my %files;
foreach my $CDR (@RECList) {
my ($filename,$row)= split(/,/, $CDR);
$files{ $filename } ||= []; # start with empty array
push @{ $files{ $filename }}, $row; # append row to that array
};
# Now print out the data
for my $filename (sort keys %files) {
open my $csv_fh, '>>', "/ClientRrecord/$filename.csv") or die "cou
+ldn't open [$filename.csv]\n".$!;
print { $csv_fh } map { "$_\n" } @$row; # print out all lines (wit
+h newlines)
};
| [reply] [d/l] |
|
|
Since open/close is too slow, perhaps you might try a database, like sqlite?
I made a small test program and the output seems to indicate that it would be significantly faster:
C:\> perl trymany.pl
connecting to dbi:SQLite:db.sqlite3 ...
connected to dbi:SQLite:db.sqlite3
ready to begin
Rate openclose sqlite
openclose 2580/s -- -98%
sqlite 121065/s 4593% --
The code is:
#!/usr/bin/perl
# trymany - compare open/write/close with db access
#vim: syntax=perl
use v5.14;
use warnings;
use Benchmark qw( :all );
use File::Path qw( make_path );
use DBI;
sub make_file_name { #FUNCTION $top -> $name
my ($top) = @_;
my $f = $top . '/' . sprintf "%04d", rand(10000);
return $f;
}
sub make_dir { #FUNCTION $dir -> $dir
my ($dir) = @_;
return $dir if (-d $dir);
my $ok = make_path($dir) or die "cannot mkdir $dir: $!\n";
return $dir;
}
my $db_file = "db.sqlite3";
my $dsn = "dbi:SQLite:$db_file";
my $table = "testtab";
my $column = "data";
my $insert_sql = "insert into $table ($column) values (?)";
my $create_sql = "create table $table ($column varchar)";
my $commit_every = 1000;
my $uncommitted = 0;
my $record = 'x' x 80;
my $create_table = (-f $db_file) ? 0 : 1;
my $top = "dirs";
make_dir($top) or die "cannot mkdir $top $!\n";
my $n = (shift @ARGV) || 100000;
my $seed = (shift @ARGV) || 12523477;
srand($seed);
warn "connecting to $dsn ...\n";
my $dbh = DBI->connect($dsn, '', '', {
AutoCommit => 0,
PrintError => 1,
RaiseError => 1,
}) or die "cannot connect to $dsn: $!\n";
warn "connected to $dsn\n";
if ($create_table) {
$dbh->do($create_sql) or die "cannot create table $DBI::errstr
+\n";
warn "created table\n";
}
my $sth = $dbh->prepare($insert_sql) or die "cannot prepare: $DBI::err
+str\n";
my $first_sqlite = 1;
warn "ready to begin\n";
cmpthese( $n, {
openclose => sub {
state $dir = make_dir("$top/openclose");
my $f = make_file_name($dir);
open my $fh, ">>$f" or die "cannot open $f for append;
+ $!\n";
defined(print $fh $record) or die "cannot write $f: $!
+\n";;
close($fh) or die "cannot close $f: $!\n";
},
sqlite => sub {
$sth->execute($record) or die "cannot insert: $DBI::er
+rstr\n";
++$uncommitted;
if ($uncommitted >= $commit_every) {
$dbh->commit or die "cannot commit $DBI::errst
+r\n";
$uncommitted = 0;
}
},
});
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
|
|
Dear Corion, thanks for replay.
its good idea if we have low number of client.
The problem is we have more than 5M subscriber (with more than 2.7M active client), and we have Continuous input of logs (about 4~5 file every minute), with every log file has ~18K raw.
so its very rare to have client made more than 3 record in the same log file, which mean it make no much difference if sorting the data or not.
-- but it will be good idea if we add more than one log file together in that list @RECList.
-- now, another Question. how many raw we can add the array? or what is the limit size of array in perl?
also, I know that opreating system has some fault in this (windows 7 & windows server 2008) we relay wish to use linux but cannot :(
and I know the slowness are from open and close too many file, I try to find any other way to write to files other than this but couldn't find any (I'm beginner in perl ~ but I like it very much ^_^).
BR
Hosen
| [reply] |
|
|
As it currently is, you are doing 18k open+close per file. Open and close are slow on Windows. With my approach, you will reduce the number of open+close. If you process more than one file before writing the output, you can reduce the number of open+close per client even more.
Perl has no limit for the array size other than available memory.
| [reply] |
|
|
This thought may be way off topic here, but if you can't change OS, and for some reason can't employ Corions solution, perhaps you might consider the hardware itself? There are some SSD (Solid State Drives) which reportedly increase performance 100x over spinning drives. If one was to carefully select an SSD that has been test proven to perform with your flavor of OS, you might see some gain that way. If I understand your numbers right, even a 10x performance increase would help you out.
just a thought...
(honestly, I think the solution that Corion referred to is the way to go)
...the majority is always wrong, and always the last to know about it...
Insanity: Doing the same thing over and over again and expecting different results...
A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct
| [reply] |
|
|
What corion said :)
On my old 2006 laptop with 3500rpm harddisk ... processing/printing 18k records with a single open/close takes under four seconds consistently
Doing an extra 18k open/close it takes twice as long or longer ( 7-27)
| [reply] |
|
|
|
|
Re: best way to fast write to large number of files
by Laurent_R (Canon) on Jun 23, 2014 at 19:10 UTC
|
I do not know enough about your requirements to figure out if my suggested ideas make sense. The question is: do you need your client files to be updated every minute or even every 10 minutes or even every hour? Probably not, I would suspect you need to process the log files quite often, but not necessary update your client files so often.
Based on these assumptions, I can think of two general types of solution.
One is to read the log files and store the daily activity into a database and to download into the client files the database content once per day (or pick up any other time interval better suiting your needs). The advantage is that the overhead of opening so many files occurs only once per day.
Another idea is to pseudo-hash your client logs into temporary files. For example, you could store into a file all logs concerning client whose customer number ends with 00. In another file logs pertaining to clients whose customer number ends with 01. And so on until 99. So that each time you read a log, and assuming you sorted the log by the last two digits of your customer number, you only need to to open for write only 100 files, which will mean much less overhead than 18K files. Then, again, once per day (or whatever better schedule fits your needs better), you process these temporary files to put the records into the final client files. I am fairly sure that using such a mechanism would give you a huge gain.
Of course, the idea of using 100 temporary files per day and process them once per day are just random numbers that I picked up because they made some sense to me. You may want to change both numbers so something else if it makes more sense to your case. It could be once per hour, and it could be more temporary files or less temporary files. You have to figure out the best combination based on your knowledge of the situation and actual tests on the data.
| [reply] |
|
|
Dear Lauren_R
Thanks for your replay, Its has very interesting ideas (I read your replay more than 5 times ^_^). And yes,that is what you suggested was true.
what's in my mind now (go with first idea) is to load the logs to DB (we will use mySQL - thanks sundialsvc4 for the idea about using DB), and then we will run some groupby query then write the result to the specific files, this will reduce the number of opening and closing.
I think with the right schedule we can handle all the files without any delay.
and we will try the second idea too, because it has also good approach to resolve the issue.
we will compare the two idea and off-curse chose the best ;)
I will update shortly.
BR
Hosen
| [reply] |
|
|
Hi Hosen,
I suspect that the second solution will be significantly faster because your overall process (write once, read once each record) only marginally builds on the advantages of a data base, while appending data to 100 files is very fast. But I'll be very interested to read your update on this.
| [reply] |
Re: best way to fast write to large number of files
by locked_user sundialsvc4 (Abbot) on Jun 23, 2014 at 18:48 UTC
|
Well, it may be too-much of a design alternative to consider, but this might be a fine application for an SQLite database file (or files) in which $filename is an indexed column. This would, in effect, push the cataloging chore off of the filesystem, and onto the indexing capabilities of this (very high-performance) database engine. In this case, I think it’s a possibility well worth considering.
Note: one vital caveat is that transactions must be used when writing to SQLite3 databases, since otherwise every disk-write is physically verified.
| |
|
|
Note: one vital caveat is that transactions must be used when writing to SQLite3 databases, since otherwise every disk-write is physically verified.
So you're saying
if I don't use transactions, SQLite physically verifies every disk write,
and if I use transactions, SQLite doesn't physically verify every disk write?
When does SQLite ever "physically verify" a write? In case you're talking about fsync(2) (which only flushes writes to the disk), transactions are only indirectly involved. In fact, every change to an SQLite database happens in a transaction, whether explicitly or implicitly. fsync() behavior is actually controlled via PRAGMA synchronous and influenced by PRAGMA journal_mode (see e.g. WAL).
| [reply] [d/l] |