comment on

I'm sorry. I did see you're explaination of your set-up, but mis-read it. There is a chance that you might see some benefit from use threads to overlap the splitting with IO waits on other threads.

I think that the bottleneck will simply move to the communications between your threads and the DB threads, but it is worth a try. I also think that this will be way less efficient than letting the DB do the bulk loading.

And, if as your sample code shows, you need to use transactions, this will almost certainly will be slower, and as you will need to use multiple DB handles for this to work, the benefit of transactions would be lost, as you are spreading the transactions across multiple streams.

It's even possible that this might slow things down, as your threads will be competing with the DB threads for the cpu. Your figure of 20% utilisation suggests that this should not be the case.

Notionally, you would think that reading and splitting the csv on one thread and then sharing the splt data to one or more workers to insert into the DB would be the way to go. The problem with that is that sharing large volumes of data, especially structured data, requires a lot of duplication and locking. My experiences with this are that it slows things down rather than speeds them up.

So, the idea behind the code below is that by duplicating the open csv filehandle in each of the threads, each thread reads the next available line from the file, splits and then inserts it. Whilst any one thread is in wait states on fileIO or DBIO, there shoudl be another thread ready to utilise the cpu to do the splitting. That's the theory anyway. The measure of success will be a) whether your cpu utilisation increases; b) whether the job gets done more quickly. I do not have a multi volume machine to try out this theory.

I also have some doubts about the safety of sharing an open filehandle this way. On my, single cpu machine this appears to work just fine, but your mileage may vary.

If it appears to work and benefit you, start with 2 or 3 threads and increase the number by 1 until you see no greater benefit. I doubt that more than 5 will be beneficial. If that.

Caveat implimentor: This is untested, and for the purpose of discussion and experimentation only. Test this thoroughly, to your own satisfaction before using it on valuable data.

#! perl -slw
use strict;
use threads;
use threads::shared;
use DBI;

##### Warning: entirely untested code ##############

our $THREADS ||= 5;

## A shared 'semaphore'
## This works fine in single cpu boxes
## but may need to be replaced with a 'proper' semaphore 
## on multi-cpu machines
my $running : shared = 0; 

sub worker {
    $running++;
    
    ## get a new handle to the DB in each thread
    my $dbh = DBI->connect(...);
    my $sth = $dbh->prepare(
        "INSERT INTO T_$filename ("
        . join( ', ', @column )
        . ") VALUES ("
        . join( ',', ( '?' ) x @column )
        .")"
    );

    ## get a handle to the CSV file for this thread
    my $fileno = shift;
    open my $fh, "<&=$fileno" or die $!;

    ## process the data
    ## batching would be of little benefit 
    ## if you're going to read a batch
    ## and then execute them individually
    while( <$fh> ) {
        $sth->execute( split ',', $_ );
    }
    $dbh->disconnect;
    $running--;
}

my $fh;
## Open the file in root 
open $fh, 'file.csv' or die $!;

## and then 'share' it with the worker threads by passing the fileno
threads->create( \&worker, fileno( $fh ) )->detach for 1 .. $THREADS;

sleep 1 while $running < $THREADS; ## Let'em start
sleep 1 while $running; ## Wait for'em to finish
[download]

Examine what is said, not who speaks. The end of an era!

"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re^3: Implementing a buffered read-and-insert algorithm by BrowserUk
in thread Implementing a buffered read-and-insert algorithm by radiantmatrix

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.