comment on

Earlier this year, I received enlightenment on a Better way to work with large TSV files? -- now I am revisiting the problem for a related application.

I have a collection of large files -- 1.6M rows, on average -- in a CSV format. I need to insert them into a database. Unlike my earlier problem, though, the network link is not the weakest. The database is on the same machine my application will run on. My "weak links" are disk speed and database insert speed. (Also, I don't need to "pop" anything anymore). My current approach looks like this (roughly):

open FH, '<', "$filename.csv" or die("Can't open $filename.csv");
$dbh->begin_work;
my $sth = $dbh->prepare("INSERT INTO T_$filename (".join(', ',@column)
+.
   ") VALUES (".join(',',map{'?'} @column).")";

my $c_size = 500; #size of buffer chunk
my @buffer;

while (<FH>) {
   my @row = split(','); #No quoting issues in these files, yay!
   push @buffer, \@row;
   next unless (@buffer >= $c_size or eof(FH));
   
   while (@buffer) {
      my $row = shift @buffer;
      $sth->execute(@$row);
   }
}

close FH;
$dbh->commit;
[download]

In short, I read chunks of 500 rows and then insert them into the database. It works. But, I had an idea of how to do it better. The problem is, I don't know where to begin in order to implement it. Basically, I want to do this:

Start reading lines into a buffer
Insert lines from buffer into the database, waiting for more lines when the buffer is empty (until EOF)
When the file ends, finish inserting from the buffer and exit

I'm thinking threads are the answer, but even after reading the manuals I'm not sure I understand enough about threading to be sure this is the right approach. And, I don't know how to even begin to search CPAN for existing "wheels" that do things this way.

Any ideas or pointers on how to begin, or for approaches that might be even better? I humbly await enlightenment from ye noble Monks.

radiantmatrix
require General::Disclaimer; s//2fde04abe76c036c9074586c1/; while(m/(.)/g){print substr(' ,JPacehklnorstu',hex($1),1)}

In reply to Implementing a buffered read-and-insert algorithm by radiantmatrix

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.