Earlier this year, I received enlightenment on a Better way to work with large TSV files? -- now I am revisiting the problem for a related application.
I have a collection of large files -- 1.6M rows, on average -- in a CSV format. I need to insert them into a database. Unlike my earlier problem, though, the network link is not the weakest. The database is on the same machine my application will run on. My "weak links" are disk speed and database insert speed. (Also, I don't need to "pop" anything anymore). My current approach looks like this (roughly):
open FH, '<', "$filename.csv" or die("Can't open $filename.csv"); $dbh->begin_work; my $sth = $dbh->prepare("INSERT INTO T_$filename (".join(', ',@column) +. ") VALUES (".join(',',map{'?'} @column).")"; my $c_size = 500; #size of buffer chunk my @buffer; while (<FH>) { my @row = split(','); #No quoting issues in these files, yay! push @buffer, \@row; next unless (@buffer >= $c_size or eof(FH)); while (@buffer) { my $row = shift @buffer; $sth->execute(@$row); } } close FH; $dbh->commit;
In short, I read chunks of 500 rows and then insert them into the database. It works. But, I had an idea of how to do it better. The problem is, I don't know where to begin in order to implement it. Basically, I want to do this:
I'm thinking threads are the answer, but even after reading the manuals I'm not sure I understand enough about threading to be sure this is the right approach. And, I don't know how to even begin to search CPAN for existing "wheels" that do things this way.
Any ideas or pointers on how to begin, or for approaches that might be even better? I humbly await enlightenment from ye noble Monks.
radiantmatrix
require General::Disclaimer;
s//2fde04abe76c036c9074586c1/; while(m/(.)/g){print substr(' ,JPacehklnorstu',hex($1),1)}
In reply to Implementing a buffered read-and-insert algorithm by radiantmatrix
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |