http://qs1969.pair.com?node_id=279606

jjhorner has asked for the wisdom of the Perl Monks concerning the following question:

All,

I've got large amounts of data that I need to put into a database (a few 10M records).

My question is this:

If I connect to the database, cycle through each line and execute the SQL to add the line to the database, it takes a while to insert into the database across Unix sockets. What ways can I speed up the insertion so that I can batch most of it? It is conceivable that I can do one mass SQL string with multiple inserts at once? Will I gain any speed that way? Example:

my $SQL = "blah\;"; $SQL .= "blah\;"; $SQL .= "blah\;";

Or should I do it normally:

my $SQL = "blah"; <insert DBI code to execute here>
Any advice will be helpful...
J. J. Horner 
CISSP,CCNA,CHSS,CHP,blah,blah,blah

Replies are listed 'Best First'.
Re: Database input speed question
by Abigail-II (Bishop) on Jul 31, 2003 at 15:21 UTC
    Many database support some form of bulk insert, like Sybase's bcp utility (and corresponding library). That should be the fastest way of doing multiple inserts. Your database might not support doing a bulk insert into a non-empty table - in that case, insert the stuff into a temporary table, and use a stored procedure to move the stuff from the temporary table into the final table.

    Also, if you have indices on the table, drop the indices, do the insert, and create the indices again. Dropping and recreating insert triggers could also be a huge gain, but you have to make sure it's safe to drop the trigger.

    Abigail

Re: Database input speed question
by liz (Monsignor) on Jul 31, 2003 at 15:24 UTC
    If you're using MySQL, there is such a thing as:
    LOAD DATA infile INTO table
    
    which allows you to bulk import records and fields from a properly formatted text-file (which you can easily create with Perl).

    Liz

Re: Database input speed question
by Corion (Patriarch) on Jul 31, 2003 at 15:24 UTC

    The simple answer is, don't use DBI. Every decent database has a specialized "loader" program, that bypasses SQL completely and pumps in the raw data for each table from a data file. The various loader programs vary in syntax and features, but for Oracle, have a look at sqlloader, Sybase and MSSQL have a similar program.

    The problem is now that you can't have computed fields in your database, as the loader will most likely not have the capabilities, but you can always load the database first and then update the fields that need (Perl) computation.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web

      The loader may not have the intelligence to do computed fields, but Perl does. As most likely you will have to write a Perl script anyhow to reformat your data into a format acceptable to the bulkloader, you can take that opportunity to fill the computed fields.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Database input speed question
by blue_cowdawg (Monsignor) on Jul 31, 2003 at 15:39 UTC

    Short answer #1: it depends

    Short answer #2: TIMTOWTDI!

    When considering database performance there is a wide range of issues to consider. Some of which you may or may not have any control over if you are in a shop that stovepipes responsibility and you have to deal with a DBA for database layout, a Data Architect for table and index layout and a System Administrator who does the system stuff.

    Some of the issues are:

    • How the database is layed out in terms of where the data and indexes are layed out in terms of "spindles."
    • How the tables are indexed and how many indexes are involved
    • IO badnwidth of the database server
    • I'm sure I am forgetting something...

    As far as speeding up inserts themselves go I have discovered over the years that using bulk import commands that are supplied by the database manufacturer are usually faster than inserts done via code. So if you have lots of data to insert into a table all at once you could write your data from Perl to a file and use the bulk import feature of choice and import it that way.

    An example of where I used this was for a major financial firm that I worked for I used to glean the ARP caches from our network devices from a Perl script that "walked" through the devices discovering neighboring devices along the way. This generated 20,000 ethernet address/IP address pairs (or more) that needed to be stored in a table for later analysis. Using normal inserts against an Oracle database took way too long but doing a bulk import took seconds.


    Peter @ Berghold . Net

    Sieze the cow! Bite the day!

    Test the code? We don't need to test no stinkin' code! All code posted here is as is where is unless otherwise stated.

    Brewer of Belgian style Ales

Re: Database input speed question
by dga (Hermit) on Jul 31, 2003 at 17:10 UTC

    Another thing that I didn't see mentioned is that if you can't use a bulk loader (computed fields etc.), then if you can use a transaction and do the inserts then commit. If you have to use DBI to insert lots of data this will be a big time savings since the statement preparation only happens one time instead of once per record.

    #... script starting stuff up here my $dbh=DBI->connect("DBI:Pg(RaiseError=>1, AutoCommit=>0):dbname=$dat +abase"); my $stmt="INSERT INTO table ( field1, field2 ) VALUES ( ?, ? )"; eval { $dbh->rollback; my $sth=$dbh->prepare($stmt); while(<INPUT>) { my($field1, $field2)=split; $sth->execute($field1, $field2); } $dbh->commit; }; if($@) { print STDERR "Data did not insert correctly: $@"; $dbh->rollback; }

    This does 2 things: The inserts go a lot faster. The inserts go in as a group or not at all which makes cleaning up after a failure a lot easier.

    The :Pg loads up a PostgreSQL connection which I used in this example since that's the database I use most.

    The RaiseError=>1 turns failures in DBI calls into fatal errors which the eval traps and reports on.

    The AutoCommit=>0 tells DBI not to commit records until you call commit();

      I was going to suggest something similar, but there are some issues with this as well.

      First, one must make sure to still commit at regular intervals. The original poster mentioned 10M rows (I'm guessing 10 million) - you'd have to have a rather large transaction log if you're going to batch all of these rows in a single transaction.

      Second, batching may not necessarily gain you all that much - certainly with Sybase the total amount of work that the database needs to do whether you batch transactions or not is the same - the only real difference is that rows are only committed to the "real" table when the server gets the COMMIT TRAN command instead of placing each row there individually (i.e. the total amount of disk IO is the same).

      Michael

        When in doubt, Benchmark

        I made up the following benchmark to see the effect of AutoCommit on inserting rows in bulk.

        use strict; use warnings; use Benchmark qw( cmpthese ); use DBI; my $dbc=DBI->connect('DBI:Pg(RaiseError=>1,AutoCommit=>0):dbname=...') +; my $dba=DBI->connect('DBI:Pg(RaiseError=>1,AutoCommit=>1):dbname=...') +; cmpthese ( 10, { 'ac' => sub { &inserts($dba, 'auto', 1000, 0 ) }, 'mc' => sub { &inserts($dbc, 'manual', 1000, 1 ) }, }); sub inserts { my($dbh, $table, $rows, $commit)=@_; my $stmt=qq[ INSERT INTO $table ( id, val ) VAlUES ( ?, ? ) ]; eval { $dbh->rollback if($commit); my $sth=$dbh->prepare($stmt); foreach my $row ( 1..$rows ) { $sth->execute( $row, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdef +ghijklmnopqrstuvwxyz/' ); } $dbh->commit if($commit); }; if($@) { print STDERR "Insert Failed: $stmt: $@"; $dbh->rollback; die "Program Terminated"; } return; } END { $dbc->disconnect; $dba->disconnect };

        Here are some results.

        #run 1 Benchmark: timing 10 iterations of ac, mc... ac: 33 wallclock secs ( 0.33 usr + 0.56 sys = 0.89 CPU) @ 11.24/s ( +n=10) mc: 3 wallclock secs ( 0.38 usr + 0.35 sys = 0.73 CPU) @ 13.70/s ( +n=10) #run 2 Benchmark: timing 10 iterations of ac, mc... ac: 37 wallclock secs ( 0.41 usr + 0.81 sys = 1.22 CPU) @ 8.20/s ( +n=10) mc: 4 wallclock secs ( 0.37 usr + 0.50 sys = 0.87 CPU) @ 11.49/s ( +n=10) #run 3 Benchmark: timing 10 iterations of ac, mc... ac: 38 wallclock secs ( 0.48 usr + 0.60 sys = 1.08 CPU) @ 9.26/s ( +n=10) mc: 4 wallclock secs ( 0.38 usr + 0.40 sys = 0.78 CPU) @ 12.82/s ( +n=10)

        Note that I am comparing wall clock time since the perl code has very little to do. I ran 3 runs so that a representative sample could be obtained. This is running against PostgreSQL as the backend on the local host so there is minimal communication overhead.

        Committing after each 1000 rows in this test consistantly yields a 10 fold increase in speed over using AutoCommit. As usual YMMV and will certainly vary if you use a different database engine. Also note that using the bulk data importer from a text file containing the same data takes less than 1 second to complete while running 1 insert with 1 commit for 10000 rows takes about 3 seconds.

        The data set size in this test is only 663k of data. I am estimating that a significant portion of the time difference is that when commit returns, the database pledges that the data has been written to durable media. So for the manual commits this happens 10 times whereas for the AutoCommit this occurs 10000 times. If that were all the variability then manual commit would be 1000 times faster instead of 10 times so the actual writing of the data constitutes a big portion of the time and that, as mentioned, is the same for any approach.

Re: Database input speed question
by hardburn (Abbot) on Jul 31, 2003 at 15:26 UTC

    How often are you going to be inserting all this data? Is it a one-time thing? Monthly? Weekly? I doubt optimizing for speed will matter unless you need to insert that much data once an hour or so.

    Good ol' placeholders with multiple execute statements should do fine:

    my @DATA = ( [ qw( foo bar baz ) ], [ qw( foo1 bar1 baz1 ) ], . . . ); my $sql = q/ INSERT INTO table ( col1, col2, col3 ) VALUES ( ?, ?, ?) /; # $dbh is an existing DBI connection my $sth = $dbh->prepare($sql) or die . . . ; foreach my $row (@DATA) { $sth->execute( @{$row} ); } my $sth->finish;

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

Re: Database input speed question
by DigitalKitty (Parson) on Jul 31, 2003 at 17:51 UTC
    Hi jjhorner.

    Good question. If you are running mySQL, you can use two primary tools. From the mySQL prompt enter:
    mysql> LOAD DATA LOCAL INFILE "filename.txt" INTO TABLE tablename

    Since mySQL isn't case sensitive, you need not use the same uppercase style as I did. The filename must be the same as the name of the table. Example:
    mysql> LOAD DATA LOCAL INFILE "monks.txt" INTO TABLE monks

    There must be no 'extra' characters before the .filetype. If so, mySQL will try to insert the data into the wrong table. The other tool is called 'mysqlimport' and it acts as a command line interface to LOAD DATA.

    It works like this:
    (system prompt) mysqlimport --local database_name datafile.txt


    Hope this helps,
    -Katie
Re: Database input speed question
by chunlou (Curate) on Jul 31, 2003 at 17:12 UTC

    In addition to what've been said, the following statement could also be considered if sensible to the program (multiple rows in one insert statement).

    INSERT INTO table VALUES (val11,val2), (val21,val22), (val31,val32)

    But the DB native batch facility is still the best bet especially if the insert doesn't have to be real time.

      I must point out that syntax doesn't work in all DBMS. For PostgreSQL, you'll need something a bit more complex:
      INSERT INTO table SELECT val11, val12 UNION SELECT val21, val22 UNION SELECT val31, val32 etc.
      I don't think this will be significantly faster for large collections of data than to just use DBI with a statement handle and placeholders (but YMMV).
      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise
        Thanks for the clarification.
Re: Database input speed question
by cleverett (Friar) on Jul 31, 2003 at 18:35 UTC
    There are valid reasons to do what you want to do from DBI, especially when you're talking a few 10M records. I've had to myself.

    If you happen to be using MySQL you can indeed do multiple inserts in one DBI operation, and it does dramatically speed things up (I get 11K inserts/sec on a lightly loaded mysql-max 3.23.49 using InnoDB on an Athlon 1800+ server with a 60GB seagate drive, transferring a 2KB wide table):

    INSERT INTO log (banned_account, site_account, banner_id, display_date, display_time, +click_date, click_time, cookie_id) VALUES (1, 110, 12, '2002-12-31', '06;32:04', '0000-00-00', '00:00:00', 1), (102, 78, 234, '2002-12-31', '06;32:07', '2003-12-31', '06:32:23', 2), (24, 154, 123, '2002-12-31', '06;32:08', '0000-00-00', 00:00:00', 3)
Re: Database input speed question
by Mur (Pilgrim) on Jul 31, 2003 at 17:40 UTC
    If you are experiencing slowdowns on batch insert, one cross-DBMS solution is to drop the indexes on the table, do the inserts, and re-create the indexes. This almost always speeds things up, at the expense of giving up uniqueness (primary key) checks in some DBMS.
    --
    Jeff Boes
    Database Engineer
    Nexcerpt, Inc.
    vox 269.226.9550 ext 24
    fax 269.349.9076
     http://www.nexcerpt.com
    ...Nexcerpt...Connecting People With Expertise
Re: Database input speed question
by jjhorner (Hermit) on Jul 31, 2003 at 19:34 UTC

    This issue has plagued me all day, but I've come up with a solution that I like:

    I use a perl script to sort the data, open(SQL,"|psql $database"), 'print SQL "BEGIN\;\n"' and print each generated SQL statement directly to the pipe called SQL. When the $count % 1000 is 0, I 'print SQL "COMMIT\;\nBEGIN\;\n";'.

    I'm getting about 2k entries/second vs the 129 entries/second using the DBI method.

    I admit that it isn't pretty, but it gets the job done and quickly. I have a year's worth of data (~600k entries/day) to dump into my WFA web activity database soon. This should make it go quicker.

    J. J. Horner 
    CISSP,CCNA,CHSS,CHP,blah,blah,blah
    
Re: Database input speed question
by JSchmitz (Canon) on Jul 31, 2003 at 19:47 UTC
    I would not use LOAD DATA INFILE it is kinda of a drag compared to the much slicker mysqlimport which should be located in /usr/local/mysql/bin. Check that out you won't be dissappointed.

    cheers
    Jeffery

      It's not "much slicker", it's just a command-line version of the same thing! The LOAD DATA INFILE is an SQL command; the mysqlimport is a command-line interface to that SQL command. They're used in different situations. See the docs.

Re: Database input speed question
by deliria (Chaplain) on Jul 31, 2003 at 19:43 UTC

    If you are using MySQL and need to do some processing which can't be handled by a bulk importer, you could use

    INSERT DELAYED INTO table (columns) VALUES ( 'values' )

    This will cause the client to abandon the query after it's send to the server and thereby speeding things up. If you don't need to get the return values, or for example an auto-increment number, this might be a way to go.