menolly has asked for the wisdom of the Perl Monks concerning the following question:

The product: database-driven, web-interface, fairly complex

Currently: Much data is mirrored between clusters ("drones"), but clients have a home drone where they are always directed at login. The idea when the mirroring was implemented was that we would be able to easily change a client's home drone, with minimal data transfer. (Well, there's a bit more to it, but it's not relevant.) We're not doing that, and we need to shrink the database due to resource constraints.

The task: Delete data belonging to foreign clients (clients not based on the drone where the script is running)

Now, if clientID were the key for (or even in) all the relevant tables, this would be easy. It's not. I'm deleting from four tables. clientID is in testDB (unique key testDigest), testDigest is in testDBRepCatTarget and testDBToMCQDB, and testDBToMCQDB contains repCatID, which is the key in table testDBRepCats.

Currently, I'm using selectcol_arrayref() to get a list of aDigest values which should be deleted, chunking it, and using execute_array() to delete.

The problem is, it takes too long. I've got time-out code so no single run will take more than 3 hours (or so), as requested, but it has to be run enough times that the total time needed is just too long.

I'm not sure if the problem is my approach (too clever? are these DBI functions too inefficient?), my SQL (would a DELETE with a complex join be faster?), or simply the amount of data (several hundred thousand records in a alone). Any tips or ideas for speeding this up would be appreciated. I've included the code below.

#!/usr/bin/perl -w use strict; use vars qw( $VERSION $DEBUG $lockFile $counter ); BEGIN { unshift(@INC,'/opt/software/current/lib/perl'); } $VERSION = '1'; $DEBUG=1; use appLib; # internal module, used for logging, etc. use DBI; $| = 1; appLib::toolInit(); my $dbh = DBI->connect("dbi:mysql:db","user","password") or die "faile +d: $DBI::errstr"; my $drone = $ENV{ROLE}; $drone =~ s/db\d+$//; # get test digests # stuff in array my $all_digests = $dbh->selectcol_arrayref("select testDigest from tes +tDB t, clients c where t.clientID=c.clientID and c.drone != '$drone' +and t.testType != 'Administrative' and t.clientID !='edutest'"); unless (scalar @$all_digests > 0) { die "No foreign tests to process!" +; } $DEBUG && print "all digests: ".scalar @$all_digests."\n"; my ($avg, $delta); my $stop = 60*60*3; # 60s/m * 60m/h * 3h my $now = time; my $start = time; my $i = 0; my $chunk = 1000 my (@status, @digests); my ($count, $sth, $done); do { my $first = $i; my $last = ($i+=$chunk) < scalar @$all_digests ? $i : scalar @ +$all_digests; @digests = @{$all_digests}[$first..($last-1)]; $count = 0; # del from affected tables my $digest_str = join "','", @digests; my $repCats = $dbh->selectcol_arrayref("select repCatID from t +estDBToMCQDB where testDigest in ('$digest_str')"); $sth = $dbh->prepare("delete from testDBRepCats where repCatID + = ?"); $done = $sth->execute_array({ ArrayTupleStatus => \@status },$ +repCats); grep {$count++ if $_} @status; print "$count / $done digests processed for testDBRepCats\n"; $count = 0; $sth = $dbh->prepare("delete from testDBToMCQDB where testDige +st = ?"); $done = $sth->execute_array({ ArrayTupleStatus => \@status },\ +@digests); grep {$count++ if $_} @status; print "$count / $done digests processed for testDBToMCQDB\n"; $count = 0; $sth = $dbh->prepare("delete from testDBRepCatTarget where tes +tDigest = ?"); $done = $sth->execute_array({ ArrayTupleStatus => \@status },\ +@digests); grep {$count++ if $_} @status; print "$count / $done digests processed for testDBRepCatTarget +\n"; $count = 0; $sth = $dbh->prepare("delete from testDB where testDigest = ?" +); $done = $sth->execute_array({ ArrayTupleStatus => \@status },\ +@digests); grep {$count++ if $_} @status; print "$count / $done digests processed for testDB\n"; $delta = time - $now; $avg += $delta; $avg /= 2 if $avg > $delta; $now = time; $DEBUG && print "delta: $delta\tavg: $avg\ttotal: ".($now-$sta +rt)."\tdigests: ".scalar @digests."\n"; } while ((scalar @digests) == $chunk && ($now-$start) < ($stop-$avg)); +# we haven't exceeded max time and we've probably +# got time for another iteration

Update: I've added alter table $TABLE disable keys and alter table $TABLE enable keys to the start and end of the script. disable keys is much faster than dropping the indices, and accomplishes the goal of not updating the indices while doing the massive delete. Rather than re-add and re-drop everytime the script hits the time limit, I added an extra level of looping, a prompt to determine whether the user wishes to continue or needs to stop and use the machine for other tasks, and a flag to allow it to run unattended.

Replies are listed 'Best First'.
Re: Slow script: DBI, MySQL, or just too much data?
by dragonchild (Archbishop) on Apr 15, 2005 at 17:29 UTC
    Which $sth statement is taking the longest? What do your execution plans look like? The bottleneck is going to be the database, but the problem is almost always in your code. (Sometimes, indices can help with SELECT statements, but your DELETE statements will drag if you have too many.)

    One possibility is to drop all indices that aren't related to your SELECT statements, do your DELETEs, then reindex the tables when you're done. Oh - OPTIMIZE and ANALYZE before and after can help out, too.

      OPTIMIZE before is the problem -- one reason we want to shrink the database is that we're having problems with routine OPTIMIZEs failing to complete in a reasonable time.
        How big are these tables?? I've never had an optimize take more than 10 minutes, even on 5 million rows of 20+ columns each with several multi-column indices. Also, are they InnoDB or MyISAM? That's going to make a difference (especially in the time for DELETEs if you have a lot of FK's ...)
Re: Slow script: DBI, MySQL, or just too much data?
by naChoZ (Curate) on Apr 15, 2005 at 17:32 UTC

    AFAIK, (which isn't a vast amount...) the SELECT you're doing already *is* doing a join, but on the sql parser side internal to mysql.

    Anytime you find yourself selecting something just so you can feed it back to the db as parameters, it's a good bet you're making perl do the work that should be done by the db. So, yes, you'd probably be much better off if you used a more complex DELETE statement.

    --
    "This alcoholism thing, I think it's just clever propaganda produced by people who want you to buy more bottled water." -- pedestrianwolf

      Yes, there are some simple joins already, but I'd have to replace, for example:
      delete from testDBRepCats where repCatID = ?
      with something like:
      delete r.* from testDBRepCats r, testDBToMCQDB m, testDB t, clients c +where r.repCatID = m.repCatID and m.testDigest = t.testDigest and t.c +lientID = c.clientID and c.drone != '$drone'
      (not checked for validity)
      except that's much more complex to chunk (for time-out purposes), because just adding limit 1000 won't maintain data integrity the way the current approach does -- I'm processing 1000 tests at a time, but all the data for those tests is removed before I check time and move to the next chunk, so I'm left with something like:
      delete r.* from testDBRepCats r, testDBToMCQDB m, testDB t, clients c +where r.repCatID = m.repCatID and m.testDigest = t.testDigest and t.t +estDigest = ?
      passed to execute_array(), which may or may not be any gain over the current approach, since it's a more complex statement, involving a lot more tables.
Re: Slow script: DBI, MySQL, or just too much data?
by hubb0r (Pilgrim) on Apr 15, 2005 at 18:21 UTC
    Well, probably the easiest way to at least see WHICH sql statements are taking the longest and other profiles of your db calls, turn on profiling:
    $self->{'dbh'}->{Profile} = "DBI::ProfileDumper";
    This will dump a dbi.prof file in the working dir that can be read with dbiprof to give you some good information as to exactly what is happening.
Re: Slow script: DBI, MySQL, or just too much data?
by gam3 (Curate) on Apr 15, 2005 at 17:41 UTC
    You might try this it will do your delete in one step.
    $sth = $dbh->prepare( "delete from testDBRepCats where repCatID IN (" . join(', ', map({'?'} @$repCats)) . ")"); $done = $sth->execute(@$repCats);
    But it sounds to me like a database problem. Are you using transactions. If so try turning off ATUO_COMMIT and do one commit at the end. Also you should look at the indexes you have. Having too few (or too many) indexes can slow things down.

    It is hard to say more without seeing the schema.

    -- gam3
    A picture is worth a thousand words, but takes 200K.
      It's MySQL 4.0.22. No transactions.
Re: Slow script: DBI, MySQL, or just too much data?
by eXile (Priest) on Apr 15, 2005 at 17:50 UTC
    for too-slow scripts I like to use Devel::DProf, to profile my code, and for DBI-related stuff there is also 'dbiproff'/DBI::ProfileData, to see some stats on specific queries you do. hth
Re: Slow script: DBI, MySQL, or just too much data?
by DamianKaelGreen (Acolyte) on Jan 29, 2014 at 15:53 UTC
    I had a similar problem and was able to solve it by disabling AutoCommit       my $dbh = DBI->connect("DBI:mysql:database=$database;host=localhost", ,$uname, $pswd,  {'RaiseError' => 1, AutoCommit => 0}); and then only committing changes after I had executed many-many execute statements...
    my $max_rows_commit = 0; my $store = $dbh->prepare('INSERT ... '); foreach my $row (@rows){ $store->execute($row); $max_rows_commit++; if ($max_rows_commit >10000){ $dbh->commit; $max_rows_commit = 0; } } $dbh->commit;
    A better reference for how to speed up DBI commands can be found here: http://www.perlmonks.org/bare/?node_id=273952 Good luck!