Re^2: Implementing a buffered read-and-insert algorithm

> with your source file and destination files (DB) on the same drive, your limitations are likely to be disk head rather than anything else

Yes, yes, but they aren't; same server does not mean same drive. Unless I had no other choice, I would never put a large DB on the same drive as the OS and application files. In fact, I've already explained my setup for this application.

I thought threads for this because I have plenty of processor remaining (I'm only hitting about 20%) and because it would specifically take advantage of being able to read from one device interface while writing to another.

radiantmatrix
require General::Disclaimer; s//2fde04abe76c036c9074586c1/; while(m/(.)/g){print substr(' ,JPacehklnorstu',hex($1),1)}

Comment on Re^2: Implementing a buffered read-and-insert algorithm

Replies are listed 'Best First'.
Re^3: Implementing a buffered read-and-insert algorithm by BrowserUk (Patriarch) on Dec 14, 2004 at 19:35 UTC
I'm sorry. I did see you're explaination of your set-up, but mis-read it. There is a chance that you might see some benefit from use threads to overlap the splitting with IO waits on other threads. I think that the bottleneck will simply move to the communications between your threads and the DB threads, but it is worth a try. I also think that this will be way less efficient than letting the DB do the bulk loading. And, if as your sample code shows, you need to use transactions, this will almost certainly will be slower, and as you will need to use multiple DB handles for this to work, the benefit of transactions would be lost, as you are spreading the transactions across multiple streams. It's even possible that this might slow things down, as your threads will be competing with the DB threads for the cpu. Your figure of 20% utilisation suggests that this should not be the case. Notionally, you would think that reading and splitting the csv on one thread and then sharing the splt data to one or more workers to insert into the DB would be the way to go. The problem with that is that sharing large volumes of data, especially structured data, requires a lot of duplication and locking. My experiences with this are that it slows things down rather than speeds them up. So, the idea behind the code below is that by duplicating the open csv filehandle in each of the threads, each thread reads the next available line from the file, splits and then inserts it. Whilst any one thread is in wait states on fileIO or DBIO, there shoudl be another thread ready to utilise the cpu to do the splitting. That's the theory anyway. The measure of success will be a) whether your cpu utilisation increases; b) whether the job gets done more quickly. I do not have a multi volume machine to try out this theory. I also have some doubts about the safety of sharing an open filehandle this way. On my, single cpu machine this appears to work just fine, but your mileage may vary. If it appears to work and benefit you, start with 2 or 3 threads and increase the number by 1 until you see no greater benefit. I doubt that more than 5 will be beneficial. If that. Caveat implimentor: This is untested, and for the purpose of discussion and experimentation only. Test this thoroughly, to your own satisfaction before using it on valuable data. #! perl -slw use strict; use threads; use threads::shared; use DBI; ##### Warning: entirely untested code ############## our $THREADS \|\|= 5; ## A shared 'semaphore' ## This works fine in single cpu boxes ## but may need to be replaced with a 'proper' semaphore ## on multi-cpu machines my $running : shared = 0; sub worker { $running++; ## get a new handle to the DB in each thread my $dbh = DBI->connect(...); my $sth = $dbh->prepare( "INSERT INTO T_$filename (" . join( ', ', @column ) . ") VALUES (" . join( ',', ( '?' ) x @column ) .")" ); ## get a handle to the CSV file for this thread my $fileno = shift; open my $fh, "<&=$fileno" or die $!; ## process the data ## batching would be of little benefit ## if you're going to read a batch ## and then execute them individually while( <$fh> ) { $sth->execute( split ',', $_ ); } $dbh->disconnect; $running--; } my $fh; ## Open the file in root open $fh, 'file.csv' or die $!; ## and then 'share' it with the worker threads by passing the fileno threads->create( \&worker, fileno( $fh ) )->detach for 1 .. $THREADS; sleep 1 while $running < $THREADS; ## Let'em start sleep 1 while $running; ## Wait for'em to finish [download] Examine what is said, not who speaks. The end of an era! "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Implementing a buffered read-and-insert algorithm
by BrowserUk (Patriarch) on Dec 14, 2004 at 19:35 UTC

I'm sorry. I did see you're explaination of your set-up, but mis-read it. There is a chance that you might see some benefit from use threads to overlap the splitting with IO waits on other threads.

I think that the bottleneck will simply move to the communications between your threads and the DB threads, but it is worth a try. I also think that this will be way less efficient than letting the DB do the bulk loading.

And, if as your sample code shows, you need to use transactions, this will almost certainly will be slower, and as you will need to use multiple DB handles for this to work, the benefit of transactions would be lost, as you are spreading the transactions across multiple streams.

It's even possible that this might slow things down, as your threads will be competing with the DB threads for the cpu. Your figure of 20% utilisation suggests that this should not be the case.

Notionally, you would think that reading and splitting the csv on one thread and then sharing the splt data to one or more workers to insert into the DB would be the way to go. The problem with that is that sharing large volumes of data, especially structured data, requires a lot of duplication and locking. My experiences with this are that it slows things down rather than speeds them up.

So, the idea behind the code below is that by duplicating the open csv filehandle in each of the threads, each thread reads the next available line from the file, splits and then inserts it. Whilst any one thread is in wait states on fileIO or DBIO, there shoudl be another thread ready to utilise the cpu to do the splitting. That's the theory anyway. The measure of success will be a) whether your cpu utilisation increases; b) whether the job gets done more quickly. I do not have a multi volume machine to try out this theory.

I also have some doubts about the safety of sharing an open filehandle this way. On my, single cpu machine this appears to work just fine, but your mileage may vary.

If it appears to work and benefit you, start with 2 or 3 threads and increase the number by 1 until you see no greater benefit. I doubt that more than 5 will be beneficial. If that.

Caveat implimentor: This is untested, and for the purpose of discussion and experimentation only. Test this thoroughly, to your own satisfaction before using it on valuable data.

#! perl -slw
use strict;
use threads;
use threads::shared;
use DBI;

##### Warning: entirely untested code ##############

our $THREADS ||= 5;

## A shared 'semaphore'
## This works fine in single cpu boxes
## but may need to be replaced with a 'proper' semaphore 
## on multi-cpu machines
my $running : shared = 0; 

sub worker {
    $running++;
    
    ## get a new handle to the DB in each thread
    my $dbh = DBI->connect(...);
    my $sth = $dbh->prepare(
        "INSERT INTO T_$filename ("
        . join( ', ', @column )
        . ") VALUES ("
        . join( ',', ( '?' ) x @column )
        .")"
    );

    ## get a handle to the CSV file for this thread
    my $fileno = shift;
    open my $fh, "<&=$fileno" or die $!;

    ## process the data
    ## batching would be of little benefit 
    ## if you're going to read a batch
    ## and then execute them individually
    while( <$fh> ) {
        $sth->execute( split ',', $_ );
    }
    $dbh->disconnect;
    $running--;
}

my $fh;
## Open the file in root 
open $fh, 'file.csv' or die $!;

## and then 'share' it with the worker threads by passing the fileno
threads->create( \&worker, fileno( $fh ) )->detach for 1 .. $THREADS;

sleep 1 while $running < $THREADS; ## Let'em start
sleep 1 while $running; ## Wait for'em to finish
[download]

Examine what is said, not who speaks. The end of an era!

"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

[reply]
[d/l]