menolly has asked for the wisdom of the Perl Monks concerning the following question:
The product: database-driven, web-interface, fairly complex
Currently: Much data is mirrored between clusters ("drones"), but clients have a home drone where they are always directed at login. The idea when the mirroring was implemented was that we would be able to easily change a client's home drone, with minimal data transfer. (Well, there's a bit more to it, but it's not relevant.) We're not doing that, and we need to shrink the database due to resource constraints.
The task: Delete data belonging to foreign clients (clients not based on the drone where the script is running)
Now, if clientID were the key for (or even in) all the relevant tables, this would be easy. It's not. I'm deleting from four tables. clientID is in testDB (unique key testDigest), testDigest is in testDBRepCatTarget and testDBToMCQDB, and testDBToMCQDB contains repCatID, which is the key in table testDBRepCats.
Currently, I'm using selectcol_arrayref() to get a list of aDigest values which should be deleted, chunking it, and using execute_array() to delete.
The problem is, it takes too long. I've got time-out code so no single run will take more than 3 hours (or so), as requested, but it has to be run enough times that the total time needed is just too long.
I'm not sure if the problem is my approach (too clever? are these DBI functions too inefficient?), my SQL (would a DELETE with a complex join be faster?), or simply the amount of data (several hundred thousand records in a alone). Any tips or ideas for speeding this up would be appreciated. I've included the code below.
#!/usr/bin/perl -w use strict; use vars qw( $VERSION $DEBUG $lockFile $counter ); BEGIN { unshift(@INC,'/opt/software/current/lib/perl'); } $VERSION = '1'; $DEBUG=1; use appLib; # internal module, used for logging, etc. use DBI; $| = 1; appLib::toolInit(); my $dbh = DBI->connect("dbi:mysql:db","user","password") or die "faile +d: $DBI::errstr"; my $drone = $ENV{ROLE}; $drone =~ s/db\d+$//; # get test digests # stuff in array my $all_digests = $dbh->selectcol_arrayref("select testDigest from tes +tDB t, clients c where t.clientID=c.clientID and c.drone != '$drone' +and t.testType != 'Administrative' and t.clientID !='edutest'"); unless (scalar @$all_digests > 0) { die "No foreign tests to process!" +; } $DEBUG && print "all digests: ".scalar @$all_digests."\n"; my ($avg, $delta); my $stop = 60*60*3; # 60s/m * 60m/h * 3h my $now = time; my $start = time; my $i = 0; my $chunk = 1000 my (@status, @digests); my ($count, $sth, $done); do { my $first = $i; my $last = ($i+=$chunk) < scalar @$all_digests ? $i : scalar @ +$all_digests; @digests = @{$all_digests}[$first..($last-1)]; $count = 0; # del from affected tables my $digest_str = join "','", @digests; my $repCats = $dbh->selectcol_arrayref("select repCatID from t +estDBToMCQDB where testDigest in ('$digest_str')"); $sth = $dbh->prepare("delete from testDBRepCats where repCatID + = ?"); $done = $sth->execute_array({ ArrayTupleStatus => \@status },$ +repCats); grep {$count++ if $_} @status; print "$count / $done digests processed for testDBRepCats\n"; $count = 0; $sth = $dbh->prepare("delete from testDBToMCQDB where testDige +st = ?"); $done = $sth->execute_array({ ArrayTupleStatus => \@status },\ +@digests); grep {$count++ if $_} @status; print "$count / $done digests processed for testDBToMCQDB\n"; $count = 0; $sth = $dbh->prepare("delete from testDBRepCatTarget where tes +tDigest = ?"); $done = $sth->execute_array({ ArrayTupleStatus => \@status },\ +@digests); grep {$count++ if $_} @status; print "$count / $done digests processed for testDBRepCatTarget +\n"; $count = 0; $sth = $dbh->prepare("delete from testDB where testDigest = ?" +); $done = $sth->execute_array({ ArrayTupleStatus => \@status },\ +@digests); grep {$count++ if $_} @status; print "$count / $done digests processed for testDB\n"; $delta = time - $now; $avg += $delta; $avg /= 2 if $avg > $delta; $now = time; $DEBUG && print "delta: $delta\tavg: $avg\ttotal: ".($now-$sta +rt)."\tdigests: ".scalar @digests."\n"; } while ((scalar @digests) == $chunk && ($now-$start) < ($stop-$avg)); +# we haven't exceeded max time and we've probably +# got time for another iteration
Update: I've added alter table $TABLE disable keys and alter table $TABLE enable keys to the start and end of the script. disable keys is much faster than dropping the indices, and accomplishes the goal of not updating the indices while doing the massive delete. Rather than re-add and re-drop everytime the script hits the time limit, I added an extra level of looping, a prompt to determine whether the user wishes to continue or needs to stop and use the machine for other tasks, and a flag to allow it to run unattended.
|
|---|