DBD::SQLite bulk insert woes

dwalin has asked for the wisdom of the Perl Monks concerning the following question:

While I was trying to squeeze the most in means of speed/CPU utilization out of my code, I have stumbled upon a peculiar DBD::SQLite behavior with regards to bulk data insert. I'd like to illustrate it with the following snippet:

#!/usr/bin/perl

use strict;
use warnings;
use 5.14.0;

use Benchmark;
use DBI;

my $dbh = DBI->connect('dbi:SQLite::memory:', '','',
                       { AutoCommit => 1, RaiseError => 1, PrintError 
+=> 0,
                         ChopBlanks => 1 }) or die DBI->errstr;

$dbh->do("CREATE TABLE test_bulk_insert (\n" .
         (join ",\n", map { "field$_ int" } (1..$ARGV[0])) . ");\n") o
+r
                die $dbh->errstr;

my $sth = $dbh->prepare("INSERT INTO test_bulk_insert (\n" .
                        (join ",\n", map { "field$_" } (1..$ARGV[0])) 
+.
                        ') VALUES (' .
                        (join ",\n", map { '?' } (1..$ARGV[0])) . ')')
+ or
        die $dbh->errstr;
my @newdata;
push @newdata, [] for (1..$ARGV[0]);
$sth->bind_param_array($_ + 1, $newdata[$_]) for keys @newdata;

my $data;
$data .= sprintf(("%-4d"x$ARGV[0])."\n",
        map { int rand 9999 } (1..$ARGV[0]))
                for (1..$ARGV[1]);
my $template = "A4"x$ARGV[0];

timethese($ARGV[2], {

'new' => sub {
  my $buf = $data;
  my ($curr, $prev);

  while ($buf =~ /\n/g) {
    $prev = defined $curr ? $curr : 0;
    $curr = pos $buf;
    my $_ = substr $buf, $prev, $curr - $prev;
    chomp;
    my @cols = unpack $template;
    push $_, shift @cols for (@newdata);
  };

  eval {
     $dbh->do("BEGIN IMMEDIATE TRANSACTION");
     $dbh->do("DELETE FROM test_bulk_insert");
     $sth->execute_array({}, @newdata);
     $dbh->commit;
  } or die $@;

  $#{$newdata[$_]} = -1 for keys @newdata;
},

'old' => sub {
  my $buf = $data;
  my ($curr, $prev, @batch);

  while ($buf =~ /\n/g) {
    $prev = defined $curr ? $curr : 0;
    $curr = pos $buf;
    my $_ = substr $buf, $prev, $curr - $prev;
    chomp;
    push @batch, [unpack $template];
  };

  eval {
    $dbh->do("BEGIN IMMEDIATE TRANSACTION");
    $dbh->do("DELETE FROM test_bulk_insert");
    $sth->execute(@$_) for @batch;
    $dbh->commit;
  } or die $@;
},

});

$dbh->disconnect;
[download]

The 'new' routine tries to take advantage of DBI execute_array method but somehow it fails miserably: instead of being faster it's actually twice slower:

dwalin@sol:/tmp$ perl test.pl 100 1000 100
Benchmark: timing 100 iterations of new, old...
       new: 22 wallclock secs (22.10 usr +  0.28 sys = 22.38 CPU) @  4
+.47/s (n=100)
       old: 10 wallclock secs ( 9.87 usr +  0.08 sys =  9.95 CPU) @ 10
+.05/s (n=100)
[download]

And the cause of slowness is not my code, it's DBD::SQLite. Under Devel::NYTProf it looks most peculiar:

'old' line-by-line insert:

$sth->execute(@$_) for @batch;
 # spent  10.0ms making 10000 calls to DBI::st::execute, avg 1盜/call
[download]

'new' bulk insert:

 $sth->execute_array({}, @newdata);
# spent  3.17s making 10 calls to DBI::st::execute_array, avg 317ms/ca
+ll
# spent  3.17s making 10 calls to DBD::_::st::execute_array, avg 317ms
+/call
[download]

Something in Devel::NYTProf is visibly affecting execution; 'new' code becomes ten times slower than 'old' but the picture is the same - it's slower. Why is that? I tried to look inside DBD::SQLite but I'm not advanced enough to dig deep in DBI and its drivers.

I tested it with perl-5.14.1 under Solaris 9 x86.

Regards,
Alex.

Comment on DBD::SQLite bulk insert woes Select or Download Code

Replies are listed 'Best First'.
Re: DBD::SQLite bulk insert woes by danb (Friar) on Jun 18, 2011 at 19:13 UTC
I modified your test slightly to run against several other databases: PostgreSQL and MySQL in addition to SQLite. What I found is the same result for all of them: your "new" is slower than "old": ./execute_cmp.pl 100 1000 100 Benchmark: timing 100 iterations of SQLite_new, SQLite_old... SQLite_new: 15 wallclock secs (15.27 usr + 0.07 sys = 15.34 CPU) @ 6 +.52/s (n=100) SQLite_old: 5 wallclock secs ( 5.91 usr + 0.03 sys = 5.94 CPU) @ 16 +.84/s (n=100) Benchmark: timing 100 iterations of PostgreSQL_new, PostgreSQL_old... PostgreSQL_new: 66 wallclock secs (26.55 usr + 1.75 sys = 28.30 CPU) +@ 3.53/s (n=100) PostgreSQL_old: 37 wallclock secs ( 7.67 usr + 1.14 sys = 8.81 CPU) +@ 11.35/s (n=100) Benchmark: timing 100 iterations of MySQL_new, MySQL_old... MySQL_new: 58 wallclock secs (26.45 usr + 1.43 sys = 27.88 CPU) @ 3 +.59/s (n=100) MySQL_old: 30 wallclock secs ( 8.09 usr + 0.85 sys = 8.94 CPU) @ 11 +.19/s (n=100) [download] Maybe it's time to check our assumptions. Is it really given that execute_array() is supposed to be faster? If the driver provides an optimized implementation, then I would say yes. The DBI docs say "Drivers *may* provide more optimized implementations using whatever bulk operation support the database API provides." (emphasis mine). But what about these three databases/drivers, none of which appear to provide such an optimization? When execute_array just has to call execute() behind the scenes, is it expected to be 3X slower? As far as I can tell, the only difference should be that execute_array() has to copy a large array (of arrayrefs), but that doesn't seem like enough to cause that much slowdown. I think more investigation is necessary. Here is your test with the modifications I made: Read more... (4 kB) --Daniel	[reply] [d/l] [select]
Re^2: DBD::SQLite bulk insert woes by dwalin (Monk) on Jun 18, 2011 at 19:46 UTC
Well to be frank, I didn't really know what to expect. I just tried something new (to me, at least) in hope that it will be faster - and found that it wasn't. I don't have a clue why passing a bunch of references and running supposedly the same DBI::st::execute in supposedly the same loop runs twice as slow. If you can give any tips on improving DBD::SQLite bulk insert performance by other means, I'd really appreciate that. With any other SQL engine I would go with SQL optimizations but SQLite doesn't really give many options for that. Regards, Alex.	[reply]
Re^3: DBD::SQLite bulk insert woes by Marshall (Canon) on Jun 18, 2011 at 22:32 UTC
I also don't know why there is the difference in benchmarks. After converting to Perl 5.10 code for my machine (push $aref.. and keys @array aren't valid until 5.14), I got similar results. If you are using a real disk drive and database is not "online", then `$dbh->do('PRAGMA synchronous = 0');` sped your 'old' benchmark up from 47 secs to 32 secs on my older model machine. The SQLite app that I'm working on now takes in this range for a 300K row table with mainly varchar's to be created. So I think you are in the "zone of reason" in terms of performance. Since you are mainly using ints, might be interesting to try with numeric values instead of text in your benchmark. There is a bulk import facility for SQLite that works with a csv format. I have never used it but, it might be quicker. I don't know where your actual data comes from.	[reply] [d/l]
Re^4: DBD::SQLite bulk insert woes by dwalin (Monk) on Jun 19, 2011 at 08:40 UTC
Re^5: DBD::SQLite bulk insert woes by Marshall (Canon) on Sep 12, 2012 at 07:17 UTC
Re: DBD::SQLite bulk insert woes by Anonymous Monk on Jun 21, 2011 at 03:36 UTC
When you use SQLite, you MUS*T use transactions. If you don't, then SQLite will verify that every disk write actually succeeded before returning. This slows everything to a crawl even on the fastest machines.	[reply]
Re^2: DBD::SQLite bulk insert woes by Anonymous Monk on Jun 21, 2011 at 04:03 UTC
What do you think this does? `eval { $dbh->do("BEGIN IMMEDIATE TRANSACTION"); $dbh->do("DELETE FROM test_bulk_insert"); $sth->execute(@$_) for @batch; $dbh->commit; } or die $@;` [download]	[reply] [d/l]
Re^2: DBD::SQLite bulk insert woes by dwalin (Monk) on Jun 21, 2011 at 08:30 UTC
No it doesn't. You're confusing disk writes with commits which are actually quite different things. I would suggest reading this, this and this articles to gain better understanding of how SQLite transactions and disk writes work. In this case it doesn't matter anyway because the database is held in memory and there is no delay due to disk. It's a test focused on DBD::SQLite performance rather than SQLite itself. Regards, Alex.	[reply]