Re: Implementing a buffered read-and-insert algorithm
by punch_card_don (Curate) on Dec 13, 2004 at 21:41 UTC
|
I went through the same learning curve recently.
Lesson learned the hard way: Do not do mlutiple inserts unless you have all day to wait...
Use "LOAD DATA INFILE ". This is an sql utility specifically for importing csv data into tables. It's very very fast. Example? My linewise-insert programming that strangely ressembled yours was working at ~300 file lines per hour with a complex line-analysis intermediate step on a ~1.5m line file (650Mb). Once I had completed the intermediate line-wise analysis with another program so the data was pre-formatted ready for importation, it took about a half hour for LOAD DATA INFILE to import it all.
In your case this is going to be childishly simple, since it appears you have only one table to insert to:
$sql = 'LOAD DATA INFILE $filename.csv INTO TABLE T_$;';
There are options available for field terminators and line terminators. Check out
http://dev.mysql.com/doc/mysql/en/LOAD_DATA.html
Forget that fear of gravity,
Get a little savagery in your life.
| [reply] [d/l] |
|
|
I'm chugging through 1.6M lines in about 53m. Also, most of my reasons for not using a built-in file loader have to do with portability and expandability. See this reply for more detail on the matter.
And, of course, part of it is a desire to learn more about neat things I can do with Perl. ;-)
radiantmatrix
require General::Disclaimer;
s//2fde04abe76c036c9074586c1/; while(m/(.)/g){print substr(' ,JPacehklnorstu',hex($1),1)}
| [reply] |
Re: Implementing a buffered read-and-insert algorithm
by kvale (Monsignor) on Dec 13, 2004 at 21:30 UTC
|
Often threading really helps an appplication's speed and repsponsiveness, but I don't think it will help speed up your application here. The reason is that both your CSV files and you database are on the same disk drive, but at different disk locations. So alternating between threads will only cause disk thrashing as the disk head continually seeks between the CSV and database locations. On the other hand reading sequential rows of a CSV file from disk is a relatively fast operation compared to seeks, so that is the better algorithm here.
Along the same lines, if you can, read as much of the CSV file into memeory as you can each time. That maximizes sequential reads vs. seeks and should speed up disk I/O on the CSV side. You may also want to see if your chosen database has a "loader" program that is purpose built to quickly import large amounts of data into a database.
| [reply] |
|
|
> The reason is that both your CSV files and you database are on the same disk drive, but at different disk locations
Nope. The CSV files and the OS are on one volume, and the database is on a separate RAID volume set.
> You may also want to see if your chosen database has a "loader" program that is purpose built to quickly import large amounts of data into a database.
That would seem to make the most sense on its face. However, this tool will be useful for a number of DBs, and there is a good chance that there will be pattern matching later as a sort of "pre-filter", and the occasional manipulation of data before insertion.
radiantmatrix
require General::Disclaimer;
s//2fde04abe76c036c9074586c1/; while(m/(.)/g){print substr(' ,JPacehklnorstu',hex($1),1)}
| [reply] |
|
|
this tool will be useful for a number of DBs, and there is a good chance that there will be pattern matching later as a sort of "pre-filter", and the occasional manipulation of data before insertion.
Even so, it is generally a common feature shared by all the major DB servers that each has an "import-from-data-file" tool or function, which is normally quite easy to apply, and will be highly optimized for loading data into the given server. This will always beat the pants off doing inserts with Perl/DBI on really large data sets, even when you throw in the time it takes to read the man page for the DB's import tool (which shouldn't take more than 15 minutes).
As for any pre-filtering or editing of the data, setting up a perl script to do just this step will generally be a better solution (cleaner, quicker, easier to debug/validate/update), as opposed to having this kind of logic in the same app with the relatively complicated i/o methods you're looking at (along with DBI stuff too). In other words, given a huge CVS CSV stream that isn't exactly what you'd want to feed into the DB, write a script that just creates the exact CVS CSV stream that you do want to import. Then use the DB server's native import tool to load the filtered/edited data file.
If you've got plenty time and CPU cycles to burn, go ahead and experiment with trying to approximate what DB-specific loaders already do, but if you're in a hurry and/or have a lot of different instances (data sets and/or DB servers) to worry about, keep it simple by using the existing tools for the job, and only invent the ones that don't exist yet.
update: In case it helps, bear in mind that it would be both easy and efficient to develop a perl wrapper for each of the DB-native import tools. This might not seem worthwhile until you have at least two different DB servers to work with, but once you see what the similarities and differences are in the DB loader tools, I think you could get a lot of mileage and satisfaction out of "objectifying" and "modularizing" that aspect of the problem -- Perl could provide a handy layer of abstraction to hide the differences among DB loaders, and your development time (and run time) will be much better spent in that direction.
| [reply] |
Re: Implementing a buffered read-and-insert algorithm
by VSarkiss (Monsignor) on Dec 13, 2004 at 22:47 UTC
|
If database insert speed is the bottleneck, and you have clean data, one of the best ways to gain speed is to drop indexes and constraints on the table before you insert, then re-create them afterwards.
This presumes, of course, that you have permission to execute DDL in the database, that you know the table schema, that you won't hose up your data by not having constraints, and so on.
Also, some databases, like Sybase and SQL Server, have specialized bulk-insert interfaces that will allow you to load data irrespective of constraints. But that may not work for you, since you seem to indicate you're going against multiple platforms at different times.
| [reply] |
Re: Implementing a buffered read-and-insert algorithm
by eric256 (Parson) on Dec 13, 2004 at 21:40 UTC
|
I would check if your database supports delayed inserts. That will allow you to do your buffered read and the a sort of bufferd insert. I'm not sure what that will do as far as disk reads and writes, but i've had great success with this startegy in Database->Database transfers where the insert time was holding me up. (By great success I mean from 60+seconds per table to 5-10 seconds per table)
| [reply] |
|
|
According to the MySQL manual, INSERT DELAYED gets its benefit from allowing chunks of data from many different clients to be inserted as one block. So, even if I were using MySQL, this particular application would not benefit from it.
By reading chunks of data and then inserting them in rapid sequence, I have managed to ask the database to write in larger blocks -- and that has increased performance. But, you solved another bugger of a problem I've been having, saving me another SoPW post. Thank you! ;-)
radiantmatrix
require General::Disclaimer;
s//2fde04abe76c036c9074586c1/; while(m/(.)/g){print substr(' ,JPacehklnorstu',hex($1),1)}
| [reply] |
|
|
Glad to have helped in someway. The previous paragraph in the documentation shows the advantage in your situation. Like i said before with them both on the same machine this might not yeild a great amount of benifit, but when transfering between machines its great.
When a client uses INSERT DELAYED, it gets an okay from the server at once, and the row is queued to be inserted when the table is not in use by any other thread.
Another major benefit of using INSERT DELAYED is that inserts from many clients are bundled together and written in one block. This is much faster than doing many separate inserts.
| [reply] |
Re: Implementing a buffered read-and-insert algorithm
by BrowserUk (Patriarch) on Dec 13, 2004 at 22:10 UTC
|
I'm thinking threads are the answer...
Unless your box has multiple cpus, threads are unlikely to help much. Even if it does (have multiple cpus), with your source file and destination files (DB) on the same drive, your limitations are likely to be disk head rather than anything else. Using multiple threads, even on different cpus is not going to benefit you if they are both (all) constantly waiting on IO because the seek head is switching back and forth between the source and destination.
What sort of cpu usage are you showing whilst your script is running?
Do a super search for LOAD DATA INFILE, and put tachyon as the author. He has done lots of this stuff (on MySQL) and has worked out the best strategies. Look also for his advice about dropping indexes for the duration of the load and restoring them afterwards.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] |
|
|
> with your source file and destination files (DB) on the same drive, your limitations are likely to be disk head rather than anything else
Yes, yes, but they aren't; same server does not mean same drive. Unless I had no other choice, I would never put a large DB on the same drive as the OS and application files. In fact, I've already explained my setup for this application.
I thought threads for this because I have plenty of processor remaining (I'm only hitting about 20%) and because it would specifically take advantage of being able to read from one device interface while writing to another.
radiantmatrix
require General::Disclaimer;
s//2fde04abe76c036c9074586c1/; while(m/(.)/g){print substr(' ,JPacehklnorstu',hex($1),1)}
| [reply] |
|
|
I'm sorry. I did see you're explaination of your set-up, but mis-read it. There is a chance that you might see some benefit from use threads to overlap the splitting with IO waits on other threads.
I think that the bottleneck will simply move to the communications between your threads and the DB threads, but it is worth a try. I also think that this will be way less efficient than letting the DB do the bulk loading.
And, if as your sample code shows, you need to use transactions, this will almost certainly will be slower, and as you will need to use multiple DB handles for this to work, the benefit of transactions would be lost, as you are spreading the transactions across multiple streams.
It's even possible that this might slow things down, as your threads will be competing with the DB threads for the cpu. Your figure of 20% utilisation suggests that this should not be the case.
Notionally, you would think that reading and splitting the csv on one thread and then sharing the splt data to one or more workers to insert into the DB would be the way to go. The problem with that is that sharing large volumes of data, especially structured data, requires a lot of duplication and locking. My experiences with this are that it slows things down rather than speeds them up.
So, the idea behind the code below is that by duplicating the open csv filehandle in each of the threads, each thread reads the next available line from the file, splits and then inserts it. Whilst any one thread is in wait states on fileIO or DBIO, there shoudl be another thread ready to utilise the cpu to do the splitting. That's the theory anyway. The measure of success will be a) whether your cpu utilisation increases; b) whether the job gets done more quickly. I do not have a multi volume machine to try out this theory.
I also have some doubts about the safety of sharing an open filehandle this way. On my, single cpu machine this appears to work just fine, but your mileage may vary.
If it appears to work and benefit you, start with 2 or 3 threads and increase the number by 1 until you see no greater benefit. I doubt that more than 5 will be beneficial. If that.
Caveat implimentor: This is untested, and for the purpose of discussion and experimentation only. Test this thoroughly, to your own satisfaction before using it on valuable data.
#! perl -slw
use strict;
use threads;
use threads::shared;
use DBI;
##### Warning: entirely untested code ##############
our $THREADS ||= 5;
## A shared 'semaphore'
## This works fine in single cpu boxes
## but may need to be replaced with a 'proper' semaphore
## on multi-cpu machines
my $running : shared = 0;
sub worker {
$running++;
## get a new handle to the DB in each thread
my $dbh = DBI->connect(...);
my $sth = $dbh->prepare(
"INSERT INTO T_$filename ("
. join( ', ', @column )
. ") VALUES ("
. join( ',', ( '?' ) x @column )
.")"
);
## get a handle to the CSV file for this thread
my $fileno = shift;
open my $fh, "<&=$fileno" or die $!;
## process the data
## batching would be of little benefit
## if you're going to read a batch
## and then execute them individually
while( <$fh> ) {
$sth->execute( split ',', $_ );
}
$dbh->disconnect;
$running--;
}
my $fh;
## Open the file in root
open $fh, 'file.csv' or die $!;
## and then 'share' it with the worker threads by passing the fileno
threads->create( \&worker, fileno( $fh ) )->detach for 1 .. $THREADS;
sleep 1 while $running < $THREADS; ## Let'em start
sleep 1 while $running; ## Wait for'em to finish
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] |
Re: Implementing a buffered read-and-insert algorithm
by Roy Johnson (Monsignor) on Dec 13, 2004 at 21:29 UTC
|
It does sound like what you want is threads, where one thread is reading, and the other is loading, and they're sharing a buffer. Yikes.
It's not clear to me why you consider your buffered loader to be better than the more straightforward
while (<FH>) {
$sth->execute(split(','));
}
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
Re: Implementing a buffered read-and-insert algorithm
by Thilosophy (Curate) on Dec 14, 2004 at 00:57 UTC
|
Do not use Perl/DBI/SQL for this. Use the bulk loading utility that comes with your database.
| [reply] |
Re: Implementing a buffered read-and-insert algorithm
by Roy Johnson (Monsignor) on Dec 13, 2004 at 22:17 UTC
|
Here's a little example that avoids the overhead of manipulating @buffer and @rows (I wrote it as an example program without the database stuff.)
open(FH, 'data.txt');
my $c_size = 500;
while (!eof FH) {
for (map {(eof) ? () : [split ',', <FH>]} 1..$c_size) {
print "Insert @$_\n";
}
print "Buffer done\n";
}
Caution: Contents may have been coded under pressure.
| [reply] [d/l] |
Re: Implementing a buffered read-and-insert algorithm
by mkirank (Chaplain) on Dec 14, 2004 at 05:30 UTC
|
The wise monks have already replied to your answer,
1. Bulk insert is the fastest way to Insert large data.
2. Remove Indices before Inserting and later recreate the Indices.
3.perl DBI does not support Bulk Insert so you have to use the option provided by the Database (this is different for each database ..)
Also see if DBD::SqLite can be of help ,As per its documentation this is faster than mysql.
| [reply] |
|
|
DBI does not support Bulk Insert
True, though DBD::Sybase (in it's development version, and in 1.05 once it is released) supports an experimental access to Sybase's BLK API, and this can speed up inserts by a tremendous amount.
Michael
| [reply] |
|
|
Yes I remember reading in the DBI list that you were working on Implementing this. Any Idea if the bulk insert option will be done for the other Databases as well ?
| [reply] |
|
|
Re: Implementing a buffered read-and-insert algorithm
by talexb (Chancellor) on Dec 14, 2004 at 05:03 UTC
|
Strongly suggest either LOAD DATA INFILE or some kind of bcp (Bulk Copy) program. Doing a zillion INSERTs is the slowest way to do this operation.
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] |
Re: Implementing a buffered read-and-insert algorithm
by RiotTown (Scribe) on Dec 14, 2004 at 20:14 UTC
|
You mention that portability is a large concern, so you want to avoid db-specific bulk loaders. Writing a script that would create the various DB specific loader files wouldn't be that difficult and would actually probably be a whole lot more beneficial. I find many uses for the bulk loading abilities of Oracle (and have also found that few people really know the true power and help they can bring to the table). And the performance gains you get with these utilities are HUGE. We are not talking just 10-20%, but in many cases, hundreds to thousands of times faster than normal inserts.
Also, if you are going to stay with the standard insertion method:
Threads are mentioned, but if there are any constraints set up on the table then you'll lose any performance increase on the inserts as the db has to deal with checking constraints on two sets of insertions, not a single stream of them.
500 rows between commits may be something that you can tweak (either larger or smaller) to see additional performance gains based on the performance and memory of your current DB.
| [reply] |