Hosen1989 has asked for the wisdom of the Perl Monks concerning the following question:

I have task which is: read live feed from the system which is logs of clients, ever file has about 18K record. we do some process on it and then write every record of specific client to specific file to that client. so we can have file that has every record that specific client has made.

our program can handle the first part which is read the input file and convert it to array of 2xn. which first column is client name and the second is the record.

the problem is: when the program start to write the record to the specific files there really large delay in processing which take more than 2min for each file.

@RECList = ..... clientname,record sam,plaplapla jame,bobobo kate,sososo ..... print "FLASH A-LIST\n"; foreach my $CDR (@RECList){ my ($filename,$row)= split(/,/, $CDR); open(my $csv_fh, ">> /ClientRrecord/$filename.csv") or die "couldn +\'t open [$filename.csv]\n".$!; print { $csv_fh } $row."\n"; }

Replies are listed 'Best First'.
Re: Delay when write to large number of file
by Corion (Patriarch) on Jun 23, 2014 at 08:15 UTC

    Depending on your operating system (and potentially, network setup), opening and closing a file are relatively expensive operations. As you seem to have all the data in memory already, it might be faster to sort the data according to each customer and then print it out in one go for each customer:

    my %files; foreach my $CDR (@RECList) { my ($filename,$row)= split(/,/, $CDR); $files{ $filename } ||= []; # start with empty array push @{ $files{ $filename }}, $row; # append row to that array }; # Now print out the data for my $filename (sort keys %files) { open my $csv_fh, '>>', "/ClientRrecord/$filename.csv") or die "cou +ldn't open [$filename.csv]\n".$!; print { $csv_fh } map { "$_\n" } @$row; # print out all lines (wit +h newlines) };

      Since open/close is too slow, perhaps you might try a database, like sqlite?

      I made a small test program and the output seems to indicate that it would be significantly faster:

      C:\> perl trymany.pl connecting to dbi:SQLite:db.sqlite3 ... connected to dbi:SQLite:db.sqlite3 ready to begin Rate openclose sqlite openclose 2580/s -- -98% sqlite 121065/s 4593% --

      The code is:

      #!/usr/bin/perl # trymany - compare open/write/close with db access #vim: syntax=perl use v5.14; use warnings; use Benchmark qw( :all ); use File::Path qw( make_path ); use DBI; sub make_file_name { #FUNCTION $top -> $name my ($top) = @_; my $f = $top . '/' . sprintf "%04d", rand(10000); return $f; } sub make_dir { #FUNCTION $dir -> $dir my ($dir) = @_; return $dir if (-d $dir); my $ok = make_path($dir) or die "cannot mkdir $dir: $!\n"; return $dir; } my $db_file = "db.sqlite3"; my $dsn = "dbi:SQLite:$db_file"; my $table = "testtab"; my $column = "data"; my $insert_sql = "insert into $table ($column) values (?)"; my $create_sql = "create table $table ($column varchar)"; my $commit_every = 1000; my $uncommitted = 0; my $record = 'x' x 80; my $create_table = (-f $db_file) ? 0 : 1; my $top = "dirs"; make_dir($top) or die "cannot mkdir $top $!\n"; my $n = (shift @ARGV) || 100000; my $seed = (shift @ARGV) || 12523477; srand($seed); warn "connecting to $dsn ...\n"; my $dbh = DBI->connect($dsn, '', '', { AutoCommit => 0, PrintError => 1, RaiseError => 1, }) or die "cannot connect to $dsn: $!\n"; warn "connected to $dsn\n"; if ($create_table) { $dbh->do($create_sql) or die "cannot create table $DBI::errstr +\n"; warn "created table\n"; } my $sth = $dbh->prepare($insert_sql) or die "cannot prepare: $DBI::err +str\n"; my $first_sqlite = 1; warn "ready to begin\n"; cmpthese( $n, { openclose => sub { state $dir = make_dir("$top/openclose"); my $f = make_file_name($dir); open my $fh, ">>$f" or die "cannot open $f for append; + $!\n"; defined(print $fh $record) or die "cannot write $f: $! +\n";; close($fh) or die "cannot close $f: $!\n"; }, sqlite => sub { $sth->execute($record) or die "cannot insert: $DBI::er +rstr\n"; ++$uncommitted; if ($uncommitted >= $commit_every) { $dbh->commit or die "cannot commit $DBI::errst +r\n"; $uncommitted = 0; } }, });

        It seems to me that the point of the program of the OP is to create reports for different customers. I'm not sure how creating one database file with all the customer data will help them.

      Dear Corion, thanks for replay.

      its good idea if we have low number of client.

      The problem is we have more than 5M subscriber (with more than 2.7M active client), and we have Continuous input of logs (about 4~5 file every minute), with every log file has ~18K raw.

      so its very rare to have client made more than 3 record in the same log file, which mean it make no much difference if sorting the data or not.

      -- but it will be good idea if we add more than one log file together in that list @RECList.

      -- now, another Question. how many raw we can add the array? or what is the limit size of array in perl?

      also, I know that opreating system has some fault in this (windows 7 & windows server 2008) we relay wish to use linux but cannot :(

      and I know the slowness are from open and close too many file, I try to find any other way to write to files other than this but couldn't find any (I'm beginner in perl ~ but I like it very much ^_^).

      BR

      Hosen

        As it currently is, you are doing 18k open+close per file. Open and close are slow on Windows. With my approach, you will reduce the number of open+close. If you process more than one file before writing the output, you can reduce the number of open+close per client even more.

        Perl has no limit for the array size other than available memory.

        This thought may be way off topic here, but if you can't change OS, and for some reason can't employ Corions solution, perhaps you might consider the hardware itself? There are some SSD (Solid State Drives) which reportedly increase performance 100x over spinning drives. If one was to carefully select an SSD that has been test proven to perform with your flavor of OS, you might see some gain that way. If I understand your numbers right, even a 10x performance increase would help you out.

        just a thought...

        (honestly, I think the solution that Corion referred to is the way to go)

        ...the majority is always wrong, and always the last to know about it...

        Insanity: Doing the same thing over and over again and expecting different results...

        A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct

        What corion said :)

        On my old 2006 laptop with 3500rpm harddisk ... processing/printing 18k records with a single open/close takes under four seconds consistently

        Doing an extra 18k open/close it takes twice as long or longer ( 7-27)

Re: best way to fast write to large number of files
by Laurent_R (Canon) on Jun 23, 2014 at 19:10 UTC
    I do not know enough about your requirements to figure out if my suggested ideas make sense. The question is: do you need your client files to be updated every minute or even every 10 minutes or even every hour? Probably not, I would suspect you need to process the log files quite often, but not necessary update your client files so often.

    Based on these assumptions, I can think of two general types of solution.

    One is to read the log files and store the daily activity into a database and to download into the client files the database content once per day (or pick up any other time interval better suiting your needs). The advantage is that the overhead of opening so many files occurs only once per day.

    Another idea is to pseudo-hash your client logs into temporary files. For example, you could store into a file all logs concerning client whose customer number ends with 00. In another file logs pertaining to clients whose customer number ends with 01. And so on until 99. So that each time you read a log, and assuming you sorted the log by the last two digits of your customer number, you only need to to open for write only 100 files, which will mean much less overhead than 18K files. Then, again, once per day (or whatever better schedule fits your needs better), you process these temporary files to put the records into the final client files. I am fairly sure that using such a mechanism would give you a huge gain.

    Of course, the idea of using 100 temporary files per day and process them once per day are just random numbers that I picked up because they made some sense to me. You may want to change both numbers so something else if it makes more sense to your case. It could be once per hour, and it could be more temporary files or less temporary files. You have to figure out the best combination based on your knowledge of the situation and actual tests on the data.

      Dear Lauren_R

      Thanks for your replay, Its has very interesting ideas (I read your replay more than 5 times ^_^). And yes,that is what you suggested was true.

      what's in my mind now (go with first idea) is to load the logs to DB (we will use mySQL - thanks sundialsvc4 for the idea about using DB), and then we will run some groupby query then write the result to the specific files, this will reduce the number of opening and closing.

      I think with the right schedule we can handle all the files without any delay.

      and we will try the second idea too, because it has also good approach to resolve the issue.

      we will compare the two idea and off-curse chose the best ;)

      I will update shortly.

      BR

      Hosen

        Hi Hosen, I suspect that the second solution will be significantly faster because your overall process (write once, read once each record) only marginally builds on the advantages of a data base, while appending data to 100 files is very fast. But I'll be very interested to read your update on this.
Re: best way to fast write to large number of files
by locked_user sundialsvc4 (Abbot) on Jun 23, 2014 at 18:48 UTC

    Well, it may be too-much of a design alternative to consider, but this might be a fine application for an SQLite database file (or files) in which $filename is an indexed column.   This would, in effect, push the cataloging chore off of the filesystem, and onto the indexing capabilities of this (very high-performance) database engine.   In this case, I think it’s a possibility well worth considering.

    Note: one vital caveat is that transactions must be used when writing to SQLite3 databases, since otherwise every disk-write is physically verified.

      Note: one vital caveat is that transactions must be used when writing to SQLite3 databases, since otherwise every disk-write is physically verified.

      So you're saying if I don't use transactions, SQLite physically verifies every disk write, and if I use transactions, SQLite doesn't physically verify every disk write?

      When does SQLite ever "physically verify" a write? In case you're talking about fsync(2) (which only flushes writes to the disk), transactions are only indirectly involved. In fact, every change to an SQLite database happens in a transaction, whether explicitly or implicitly. fsync() behavior is actually controlled via PRAGMA synchronous and influenced by PRAGMA journal_mode (see e.g. WAL).