comment on

The product: database-driven, web-interface, fairly complex

Currently: Much data is mirrored between clusters ("drones"), but clients have a home drone where they are always directed at login. The idea when the mirroring was implemented was that we would be able to easily change a client's home drone, with minimal data transfer. (Well, there's a bit more to it, but it's not relevant.) We're not doing that, and we need to shrink the database due to resource constraints.

The task: Delete data belonging to foreign clients (clients not based on the drone where the script is running)

Now, if clientID were the key for (or even in) all the relevant tables, this would be easy. It's not. I'm deleting from four tables. clientID is in testDB (unique key testDigest), testDigest is in testDBRepCatTarget and testDBToMCQDB, and testDBToMCQDB contains repCatID, which is the key in table testDBRepCats.

Currently, I'm using selectcol_arrayref() to get a list of aDigest values which should be deleted, chunking it, and using execute_array() to delete.

The problem is, it takes too long. I've got time-out code so no single run will take more than 3 hours (or so), as requested, but it has to be run enough times that the total time needed is just too long.

I'm not sure if the problem is my approach (too clever? are these DBI functions too inefficient?), my SQL (would a DELETE with a complex join be faster?), or simply the amount of data (several hundred thousand records in a alone). Any tips or ideas for speeding this up would be appreciated. I've included the code below.

#!/usr/bin/perl -w
use strict;
use vars qw(
    $VERSION
    $DEBUG
    $lockFile
    $counter
    );

BEGIN {
   unshift(@INC,'/opt/software/current/lib/perl');
}

$VERSION = '1';
$DEBUG=1;

use appLib; # internal module, used for logging, etc.
use DBI;

$| = 1;
appLib::toolInit();
my $dbh = DBI->connect("dbi:mysql:db","user","password") or die "faile
+d: $DBI::errstr";


my $drone = $ENV{ROLE};
$drone =~ s/db\d+$//;

# get test digests
# stuff in array

my $all_digests = $dbh->selectcol_arrayref("select testDigest from tes
+tDB t, clients c where t.clientID=c.clientID and c.drone != '$drone' 
+and t.testType != 'Administrative' and t.clientID !='edutest'");

unless (scalar @$all_digests > 0) { die "No foreign tests to process!"
+; }
$DEBUG && print "all digests: ".scalar @$all_digests."\n";

my ($avg, $delta);
my $stop = 60*60*3; # 60s/m * 60m/h * 3h
my $now = time;
my $start = time;
my $i = 0;
my $chunk = 1000 
my (@status, @digests);
my ($count, $sth, $done);

do
{
        my $first = $i;
        my $last = ($i+=$chunk) < scalar @$all_digests ? $i : scalar @
+$all_digests;
        @digests = @{$all_digests}[$first..($last-1)];
        $count = 0;
        # del from affected tables
        my $digest_str = join "','", @digests;
        my $repCats = $dbh->selectcol_arrayref("select repCatID from t
+estDBToMCQDB where testDigest in ('$digest_str')");
        $sth = $dbh->prepare("delete from testDBRepCats where repCatID
+ = ?");
        $done = $sth->execute_array({ ArrayTupleStatus => \@status },$
+repCats);
        grep {$count++ if $_} @status;
        print "$count / $done digests processed for testDBRepCats\n";

        $count = 0;
        $sth = $dbh->prepare("delete from testDBToMCQDB where testDige
+st = ?");
        $done = $sth->execute_array({ ArrayTupleStatus => \@status },\
+@digests);
        grep {$count++ if $_} @status;
        print "$count / $done digests processed for testDBToMCQDB\n";

        $count = 0;
        $sth = $dbh->prepare("delete from testDBRepCatTarget where tes
+tDigest = ?");
        $done = $sth->execute_array({ ArrayTupleStatus => \@status },\
+@digests);
        grep {$count++ if $_} @status;
        print "$count / $done digests processed for testDBRepCatTarget
+\n";

        $count = 0;
        $sth = $dbh->prepare("delete from testDB where testDigest = ?"
+);
        $done = $sth->execute_array({ ArrayTupleStatus => \@status },\
+@digests);
        grep {$count++ if $_} @status;
        print "$count / $done digests processed for testDB\n";

        $delta = time - $now;
        $avg += $delta;
        $avg /= 2 if $avg > $delta;
        $now = time;
        $DEBUG && print "delta: $delta\tavg: $avg\ttotal: ".($now-$sta
+rt)."\tdigests: ".scalar @digests."\n";
}
while ((scalar @digests) == $chunk && ($now-$start) < ($stop-$avg));  
+# we haven't exceeded max time and we've probably
                                                                      
+# got time for another iteration
[download]

Update: I've added alter table $TABLE disable keys and alter table $TABLE enable keys to the start and end of the script. disable keys is much faster than dropping the indices, and accomplishes the goal of not updating the indices while doing the massive delete. Rather than re-add and re-drop everytime the script hits the time limit, I added an extra level of looping, a prompt to determine whether the user wishes to continue or needs to stop and use the machine for other tasks, and a flag to allow it to run unattended.

In reply to Slow script: DBI, MySQL, or just too much data? by menolly

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.