in reply to DBD::SQLite bulk insert woes

I modified your test slightly to run against several other databases: PostgreSQL and MySQL in addition to SQLite. What I found is the same result for all of them: your "new" is slower than "old":

./execute_cmp.pl 100 1000 100 Benchmark: timing 100 iterations of SQLite_new, SQLite_old... SQLite_new: 15 wallclock secs (15.27 usr + 0.07 sys = 15.34 CPU) @ 6 +.52/s (n=100) SQLite_old: 5 wallclock secs ( 5.91 usr + 0.03 sys = 5.94 CPU) @ 16 +.84/s (n=100) Benchmark: timing 100 iterations of PostgreSQL_new, PostgreSQL_old... PostgreSQL_new: 66 wallclock secs (26.55 usr + 1.75 sys = 28.30 CPU) +@ 3.53/s (n=100) PostgreSQL_old: 37 wallclock secs ( 7.67 usr + 1.14 sys = 8.81 CPU) +@ 11.35/s (n=100) Benchmark: timing 100 iterations of MySQL_new, MySQL_old... MySQL_new: 58 wallclock secs (26.45 usr + 1.43 sys = 27.88 CPU) @ 3 +.59/s (n=100) MySQL_old: 30 wallclock secs ( 8.09 usr + 0.85 sys = 8.94 CPU) @ 11 +.19/s (n=100)

Maybe it's time to check our assumptions. Is it really given that execute_array() is supposed to be faster? If the driver provides an optimized implementation, then I would say yes. The DBI docs say "Drivers may provide more optimized implementations using whatever bulk operation support the database API provides." (emphasis mine). But what about these three databases/drivers, none of which appear to provide such an optimization?

When execute_array just has to call execute() behind the scenes, is it expected to be 3X slower? As far as I can tell, the only difference *should* be that execute_array() has to copy a large array (of arrayrefs), but that doesn't seem like enough to cause that much slowdown.

I think more investigation is necessary.

Here is your test with the modifications I made:

#!/usr/bin/env perl use strict; use warnings; use 5.14.0; use Benchmark; use DBI; my @databases = ( { name => 'SQLite', dsn => 'dbi:SQLite::memory:' }, { name => 'PostgreSQL', dsn => 'dbi:Pg:dbname=test_execute' }, { name => 'MySQL', dsn => 'dbi:mysql:test_execute', user => 'test_execute', pass => 'password', }, ); for my $database (@databases) { my $dbh = DBI->connect( $database->{dsn}, $database->{user}, $database->{pass}, { AutoCommit => 1, RaiseError => 1, PrintError => 0, ChopBlanks => 1 } ) or die DBI->errstr; $dbh->do("CREATE TABLE test_bulk_insert (\n" . (join ",\n", map {"field$_ int"} (1 .. $ARGV[0])) . ");\ +n") or die $dbh->errstr; my $sth = $dbh->prepare("INSERT INTO test_bulk_insert (\n" . (join ",\n", map {"field$_"} (1 .. $ARGV[0])) . ') VALUES (' . (join ",\n", map {'?'} (1 .. $ARGV[0])) . ')') or die $dbh->errstr; my @newdata; push @newdata, [] for (1 .. $ARGV[0]); $sth->bind_param_array($_ + 1, $newdata[$_]) for keys @newdata; my $data; $data .= sprintf( ("%-4d" x $ARGV[0]) . "\n", map { int rand 9999 } (1 .. $ARGV[0]) ) for (1 .. $ARGV[1]); my $template = "A4" x $ARGV[0]; timethese( $ARGV[2], { "$database->{name}_new" => sub { my $buf = $data; my ($curr, $prev); while ($buf =~ /\n/g) { $prev = defined $curr ? $curr : 0; $curr = pos $buf; my $_ = substr $buf, $prev, $curr - $prev; chomp; my @cols = unpack $template; push $_, shift @cols for (@newdata); } eval { $dbh->begin_work(); $dbh->do("DELETE FROM test_bulk_insert"); $sth->execute_array({}, @newdata); $dbh->commit; } or die $@; $#{ $newdata[$_] } = -1 for keys @newdata; }, "$database->{name}_old" => sub { my $buf = $data; my ($curr, $prev, @batch); while ($buf =~ /\n/g) { $prev = defined $curr ? $curr : 0; $curr = pos $buf; my $_ = substr $buf, $prev, $curr - $prev; chomp; push @batch, [unpack $template]; } eval { $dbh->begin_work(); $dbh->do("DELETE FROM test_bulk_insert"); $sth->execute(@$_) for @batch; $dbh->commit; } or die $@; }, } ); $dbh->disconnect; }

--Daniel

Replies are listed 'Best First'.
Re^2: DBD::SQLite bulk insert woes
by dwalin (Monk) on Jun 18, 2011 at 19:46 UTC
    Well to be frank, I didn't really know what to expect. I just tried something new (to me, at least) in hope that it will be faster - and found that it wasn't. I don't have a clue why passing a bunch of references and running supposedly the same DBI::st::execute in supposedly the same loop runs twice as slow.

    If you can give any tips on improving DBD::SQLite bulk insert performance by other means, I'd really appreciate that. With any other SQL engine I would go with SQL optimizations but SQLite doesn't really give many options for that.

    Regards,
    Alex.

      I also don't know why there is the difference in benchmarks. After converting to Perl 5.10 code for my machine (push $aref.. and keys @array aren't valid until 5.14), I got similar results.

      If you are using a real disk drive and database is not "online", then
          $dbh->do('PRAGMA synchronous = 0');
      sped your 'old' benchmark up from 47 secs to 32 secs on my older model machine. The SQLite app that I'm working on now takes in this range for a 300K row table with mainly varchar's to be created. So I think you are in the "zone of reason" in terms of performance. Since you are mainly using ints, might be interesting to try with numeric values instead of text in your benchmark.

      There is a bulk import facility for SQLite that works with a csv format. I have never used it but, it might be quicker. I don't know where your actual data comes from.

        It turned out to be even more peculiar:
        dwalin@sol:~$ /tmp/perl-5.8.9/bin/perl /tmp/test.pl 100 1000 100
        Benchmark: timing 100 iterations of new, old...
               new: 25 wallclock secs (24.35 usr +  0.15 sys = 24.50 CPU) @  4.08/s (n=100)
               old:  9 wallclock secs ( 8.99 usr +  0.06 sys =  9.05 CPU) @ 11.05/s (n=100)
        
        dwalin@sol:~$ /tmp/perl-5.10.1/bin/perl /tmp/test.pl 100 1000 100
        Benchmark: timing 100 iterations of new, old...
               new: 34 wallclock secs (33.95 usr +  0.17 sys = 34.12 CPU) @  2.93/s (n=100)
               old: 12 wallclock secs (11.28 usr +  0.04 sys = 11.32 CPU) @  8.83/s (n=100)
        
        dwalin@sol:~$ /tmp/perl-5.14.1/bin/perl /tmp/test.pl 100 1000 100
        Benchmark: timing 100 iterations of new, old...
               new: 23 wallclock secs (22.74 usr +  0.09 sys = 22.83 CPU) @  4.38/s (n=100)
               old: 10 wallclock secs (10.60 usr +  0.03 sys = 10.63 CPU) @  9.41/s (n=100)
        

        Curioser and curioser, said Alice... It seems that Perl has indeed improved in 5.14 over 5.10 - but before that, it was even faster in 5.8. Now that's a dilemma for me: opt for Modern Perl quirks and syntactic sugar I haven't been able to get used to yet, or go with ole goode 5.8 just because of sheer speed... 5.8.9 is 17% faster than 5.14.1 with 'old' routine, that's nothing to sneeze at, especially in my situation - I need to squeeze every reasonable percent out of DB insertion code... That's one tough decision.

        And I'm afraid no, I can't use bulk import facility in SQLite. The real world application I'm working on is significantly more complicated than the test I'm running here. There are several tables with mixed types of data; INSERT statements are intertwined with DELETEs and SELECTs and all of it is wrapped in BEGIN IMMEDIATE TRANSACTION/COMMIT because of concurrency. I tried the external application way but it sucked - sqlite had to parse a lot of similar statements as if they were different, and this brought it down to its knees.

        Regards,
        Alex.