jjhorner has asked for the wisdom of the Perl Monks concerning the following question:
All,
I've got large amounts of data that I need to put into a database (a few 10M records).
My question is this:
If I connect to the database, cycle through each line and execute the SQL to add the line to the database, it takes a while to insert into the database across Unix sockets. What ways can I speed up the insertion so that I can batch most of it? It is conceivable that I can do one mass SQL string with multiple inserts at once? Will I gain any speed that way? Example:
my $SQL = "blah\;";
$SQL .= "blah\;";
$SQL .= "blah\;";
Or should I do it normally:
my $SQL = "blah";
<insert DBI code to execute here>
Any advice will be helpful...
J. J. Horner
CISSP,CCNA,CHSS,CHP,blah,blah,blah
Re: Database input speed question
by Abigail-II (Bishop) on Jul 31, 2003 at 15:21 UTC
|
Many database support some form of bulk insert, like
Sybase's bcp utility (and corresponding library).
That should be the fastest way of doing multiple inserts.
Your database might not support doing a bulk insert into
a non-empty table - in that case, insert the stuff into
a temporary table, and use a stored procedure to move the
stuff from the temporary table into the final table.
Also, if you have indices on the table, drop the indices,
do the insert, and create the indices again. Dropping and
recreating insert triggers could also be a huge gain, but
you have to make sure it's safe to drop the trigger.
Abigail | [reply] [Watch: Dir/Any] |
Re: Database input speed question
by liz (Monsignor) on Jul 31, 2003 at 15:24 UTC
|
If you're using MySQL, there is such a thing as:
LOAD DATA infile INTO table
which allows you to bulk import records and fields from a properly formatted text-file (which you can easily create with Perl).
Liz | [reply] [Watch: Dir/Any] |
Re: Database input speed question
by Corion (Patriarch) on Jul 31, 2003 at 15:24 UTC
|
The simple answer is, don't use DBI. Every decent database has a specialized "loader" program, that bypasses SQL completely and pumps in the raw data for each table from a data file. The various loader programs vary in syntax and features, but for Oracle, have a look at sqlloader, Sybase and MSSQL have a similar program.
The problem is now that you can't have computed fields in your database, as the loader will most likely not have the capabilities, but you can always load the database first and then update the fields that need (Perl) computation.
perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The
$d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider
($c = $d->accept())->get_request(); $c->send_response( new #in the
HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
| [reply] [Watch: Dir/Any] [d/l] |
|
The loader may not have the intelligence to do computed fields, but Perl does. As most likely you will have to write a Perl script anyhow to reformat your data into a format acceptable to the bulkloader, you can take that opportunity to fill the computed fields. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [Watch: Dir/Any] |
Re: Database input speed question
by blue_cowdawg (Monsignor) on Jul 31, 2003 at 15:39 UTC
|
Short answer #1: it depends
Short answer #2: TIMTOWTDI!
When considering database performance there is a wide range
of issues to consider. Some of which you may or may not
have any control over if you are in a shop that stovepipes
responsibility and you have to deal with a DBA for database
layout, a Data Architect for table and index layout and
a System Administrator who does the system stuff.
Some of the issues are:
- How the database is layed out in terms of where the
data and indexes are layed out in terms of "spindles."
- How the tables are indexed and how many indexes are
involved
- IO badnwidth of the database server
- I'm sure I am forgetting something...
As far as speeding up inserts themselves go I have
discovered over the years that using bulk import commands
that are supplied by the database manufacturer are usually
faster than inserts done via code. So if you have lots
of data to insert into a table all at once you could write
your data from Perl to a file and use the bulk import
feature of choice and import it that way.
An example of where I used this was for a major financial
firm that I worked for I used to glean the ARP caches
from our network devices from a Perl script that
"walked" through the devices discovering
neighboring devices along the way. This generated 20,000
ethernet address/IP address pairs (or more) that needed
to be stored in a table for later analysis. Using normal
inserts against an Oracle database took way too long but
doing a bulk import took seconds.
Peter @ Berghold . Net
Sieze the cow! Bite the day!
Test the code? We don't need to test no stinkin' code! All code posted here is as is where is unless otherwise stated.
Brewer of Belgian style Ales
| [reply] [Watch: Dir/Any] |
Re: Database input speed question
by dga (Hermit) on Jul 31, 2003 at 17:10 UTC
|
Another thing that I didn't see mentioned is that if you can't use a bulk loader (computed fields etc.), then if you can use a transaction and do the inserts then commit. If you have to use DBI to insert lots of data this will be a big time savings since the statement preparation only happens one time instead of once per record.
#... script starting stuff up here
my $dbh=DBI->connect("DBI:Pg(RaiseError=>1, AutoCommit=>0):dbname=$dat
+abase");
my $stmt="INSERT INTO table ( field1, field2 ) VALUES ( ?, ? )";
eval {
$dbh->rollback;
my $sth=$dbh->prepare($stmt);
while(<INPUT>)
{
my($field1, $field2)=split;
$sth->execute($field1, $field2);
}
$dbh->commit;
};
if($@)
{
print STDERR "Data did not insert correctly: $@";
$dbh->rollback;
}
This does 2 things: The inserts go a lot faster. The inserts go in as a group or not at all which makes cleaning up after a failure a lot easier.
The :Pg loads up a PostgreSQL connection which I used in this example since that's the database I use most.
The RaiseError=>1 turns failures in DBI calls into fatal errors which the eval traps and reports on.
The AutoCommit=>0 tells DBI not to commit records until you call commit();
| [reply] [Watch: Dir/Any] [d/l] |
|
I was going to suggest something similar, but there are some issues with this as well.
First, one must make sure to still commit at regular intervals. The original poster mentioned 10M rows (I'm guessing 10 million) - you'd have to have a rather large transaction log if you're going to batch all of these rows in a single transaction.
Second, batching may not necessarily gain you all that much - certainly with Sybase the total amount of work that the database needs to do whether you batch transactions or not is the same - the only real difference is that rows are only committed to the "real" table when the server gets the COMMIT TRAN command instead of placing each row there individually (i.e. the total amount of disk IO is the same).
Michael
| [reply] [Watch: Dir/Any] [d/l] |
|
use strict;
use warnings;
use Benchmark qw( cmpthese );
use DBI;
my $dbc=DBI->connect('DBI:Pg(RaiseError=>1,AutoCommit=>0):dbname=...')
+;
my $dba=DBI->connect('DBI:Pg(RaiseError=>1,AutoCommit=>1):dbname=...')
+;
cmpthese ( 10, {
'ac' => sub { &inserts($dba, 'auto', 1000, 0 ) },
'mc' => sub { &inserts($dbc, 'manual', 1000, 1 ) },
});
sub inserts
{
my($dbh, $table, $rows, $commit)=@_;
my $stmt=qq[ INSERT INTO $table ( id, val ) VAlUES ( ?, ? ) ];
eval {
$dbh->rollback if($commit);
my $sth=$dbh->prepare($stmt);
foreach my $row ( 1..$rows )
{
$sth->execute( $row, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdef
+ghijklmnopqrstuvwxyz/' );
}
$dbh->commit if($commit);
};
if($@)
{
print STDERR "Insert Failed: $stmt: $@";
$dbh->rollback;
die "Program Terminated";
}
return;
}
END { $dbc->disconnect; $dba->disconnect };
Here are some results.
#run 1
Benchmark: timing 10 iterations of ac, mc...
ac: 33 wallclock secs ( 0.33 usr + 0.56 sys = 0.89 CPU) @ 11.24/s (
+n=10)
mc: 3 wallclock secs ( 0.38 usr + 0.35 sys = 0.73 CPU) @ 13.70/s (
+n=10)
#run 2
Benchmark: timing 10 iterations of ac, mc...
ac: 37 wallclock secs ( 0.41 usr + 0.81 sys = 1.22 CPU) @ 8.20/s (
+n=10)
mc: 4 wallclock secs ( 0.37 usr + 0.50 sys = 0.87 CPU) @ 11.49/s (
+n=10)
#run 3
Benchmark: timing 10 iterations of ac, mc...
ac: 38 wallclock secs ( 0.48 usr + 0.60 sys = 1.08 CPU) @ 9.26/s (
+n=10)
mc: 4 wallclock secs ( 0.38 usr + 0.40 sys = 0.78 CPU) @ 12.82/s (
+n=10)
Note that I am comparing wall clock time since the perl code has very little to do. I ran 3 runs so that a representative sample could be obtained. This is running against PostgreSQL as the backend on the local host so there is minimal communication overhead.
Committing after each 1000 rows in this test consistantly yields a 10 fold increase in speed over using AutoCommit. As usual YMMV and will certainly vary if you use a different database engine. Also note that using the bulk data importer from a text file containing the same data takes less than 1 second to complete while running 1 insert with 1 commit for 10000 rows takes about 3 seconds.
The data set size in this test is only 663k of data. I am estimating that a significant portion of the time difference is that when commit returns, the database pledges that the data has been written to durable media. So for the manual commits this happens 10 times whereas for the AutoCommit this occurs 10000 times. If that were all the variability then manual commit would be 1000 times faster instead of 10 times so the actual writing of the data constitutes a big portion of the time and that, as mentioned, is the same for any approach. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
Re: Database input speed question
by hardburn (Abbot) on Jul 31, 2003 at 15:26 UTC
|
How often are you going to be inserting all this data? Is it a one-time thing? Monthly? Weekly? I doubt optimizing for speed will matter unless you need to insert that much data once an hour or so.
Good ol' placeholders with multiple execute statements should do fine:
my @DATA = (
[ qw( foo bar baz ) ],
[ qw( foo1 bar1 baz1 ) ],
. . .
);
my $sql = q/
INSERT INTO table
(
col1,
col2,
col3
)
VALUES ( ?, ?, ?)
/;
# $dbh is an existing DBI connection
my $sth = $dbh->prepare($sql)
or die . . . ;
foreach my $row (@DATA) {
$sth->execute( @{$row} );
}
my $sth->finish;
---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
-- Schemer
Note: All code is untested, unless otherwise stated
| [reply] [Watch: Dir/Any] [d/l] |
Re: Database input speed question
by DigitalKitty (Parson) on Jul 31, 2003 at 17:51 UTC
|
Hi jjhorner.
Good question. If you are running mySQL, you can use two primary tools. From the mySQL prompt enter:
mysql> LOAD DATA LOCAL INFILE "filename.txt" INTO TABLE tablename
Since mySQL isn't case sensitive, you need not use the same uppercase style as I did. The filename must be the same as the name of the table. Example:
mysql> LOAD DATA LOCAL INFILE "monks.txt" INTO TABLE monks
There must be no 'extra' characters before the .filetype. If so, mySQL will try to insert the data into the wrong table.
The other tool is called 'mysqlimport' and it acts as a command line interface to LOAD DATA.
It works like this:
(system prompt) mysqlimport --local database_name datafile.txt
Hope this helps,
-Katie | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Database input speed question
by chunlou (Curate) on Jul 31, 2003 at 17:12 UTC
|
In addition to what've been said, the following statement could also be considered if sensible to the program (multiple rows in one insert statement).
INSERT INTO
table
VALUES
(val11,val2), (val21,val22), (val31,val32)
But the DB native batch facility is still the best bet especially if the insert doesn't have to be real time.
| [reply] [Watch: Dir/Any] [d/l] |
|
I must point out that syntax doesn't work in all DBMS. For PostgreSQL, you'll need something a bit more complex:
INSERT INTO
table
SELECT val11, val12
UNION
SELECT val21, val22
UNION
SELECT val31, val32
etc.
I don't think this will be significantly faster for large collections of data than to just use DBI with a statement handle and placeholders (but YMMV).
-- |
Jeff Boes |
Database Engineer |
Nexcerpt, Inc. |
|
|
...Nexcerpt...Connecting People With Expertise
|
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Thanks for the clarification.
| [reply] [Watch: Dir/Any] |
Re: Database input speed question
by cleverett (Friar) on Jul 31, 2003 at 18:35 UTC
|
There are valid reasons to do what you want to do from DBI,
especially when you're talking a few 10M records. I've had to myself.
If you happen to be using MySQL you can indeed do multiple inserts in one DBI operation, and it does dramatically speed things up (I get 11K inserts/sec on a lightly loaded mysql-max 3.23.49 using InnoDB on an Athlon 1800+ server with a 60GB seagate drive, transferring a 2KB wide table):
INSERT INTO log
(banned_account, site_account, banner_id, display_date, display_time,
+click_date, click_time, cookie_id)
VALUES
(1, 110, 12, '2002-12-31', '06;32:04', '0000-00-00', '00:00:00', 1),
(102, 78, 234, '2002-12-31', '06;32:07', '2003-12-31', '06:32:23', 2),
(24, 154, 123, '2002-12-31', '06;32:08', '0000-00-00', 00:00:00', 3)
| [reply] [Watch: Dir/Any] [d/l] |
Re: Database input speed question
by Mur (Pilgrim) on Jul 31, 2003 at 17:40 UTC
|
If you are experiencing slowdowns on batch insert, one cross-DBMS solution is to drop the indexes on the table, do the inserts, and re-create the indexes. This almost always speeds things up, at the expense of giving up uniqueness (primary key) checks in some DBMS.
-- |
Jeff Boes |
Database Engineer |
Nexcerpt, Inc. |
|
|
...Nexcerpt...Connecting People With Expertise
|
| [reply] [Watch: Dir/Any] [d/l] |
Re: Database input speed question
by jjhorner (Hermit) on Jul 31, 2003 at 19:34 UTC
|
This issue has plagued me all day, but I've come up with a solution that I like:
I use a perl script to sort the data, open(SQL,"|psql $database"), 'print SQL "BEGIN\;\n"' and print each generated SQL statement directly to the pipe called SQL. When the $count % 1000 is 0, I 'print SQL "COMMIT\;\nBEGIN\;\n";'.
I'm getting about 2k entries/second vs the 129 entries/second using the DBI method.
I admit that it isn't pretty, but it gets the job done and quickly. I have a year's worth of data (~600k entries/day) to dump into my WFA web activity database soon. This should make it go quicker.
J. J. Horner
CISSP,CCNA,CHSS,CHP,blah,blah,blah
| [reply] [Watch: Dir/Any] |
Re: Database input speed question
by JSchmitz (Canon) on Jul 31, 2003 at 19:47 UTC
|
I would not use LOAD DATA INFILE it is kinda of a drag compared to the much slicker mysqlimport which should be located in /usr/local/mysql/bin. Check that out you won't be dissappointed.
cheers
Jeffery | [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: Database input speed question
by deliria (Chaplain) on Jul 31, 2003 at 19:43 UTC
|
If you are using MySQL and need to do some processing which can't be handled by a bulk importer, you could use
INSERT DELAYED INTO table (columns) VALUES ( 'values' )
This will cause the client to abandon the query after it's send to the server and thereby speeding things up. If you don't need to get the return values, or for example an auto-increment number, this might be a way to go.
| [reply] [Watch: Dir/Any] [d/l] |
|
|