PERL DB Optimization

3dbc has asked for the wisdom of the Perl Monks concerning the following question:

Greeting Monks,

I am continually writing perl to amalgamate databases. Specifically, I design code to obtain some verbose data set and compare it to various other data sets in order to update and/or insert this obtained data into these various other data sets. My Question is: can I replace the rudimentary sql record by record update and insert statements with tiehash::db and/or data::dumper or possibly some other module in order to speed up this correlation and reallocation. A dependency worth considering is the fact that the databases being written to are multi-user environments and cannot be locked because they are perpetually in use.

Thanks,

my $href1 = $dbh->selectall_hashref($sql1 "key");

my $href2 = $dbh2->selectall_hashref($sql2, "key2");

my $href3 = $dbh3->selectall_hashref($sql3, "key3");

foreach (sort(keys(%$href1))) {
   if ((exists($href2->{$_}{key2})) {
     $sql = "UPDATE data_set2";
     $sql .= "set data = $href1->{$_}{data} where key2 = $_";
   }
   else {
     $sql = "insert into data_set2 $href1->{$_}{data}";
   }
   if ((exists($href3->{$_}{key3})) {
     $sql = "UPDATE data_set3";
     $sql .= "set data = $href1->{$_}{data} where key3 = $_";
   }
   else {
     $sql = "insert data_set3 $href1->{$_}{data}";
   }
}
[download]

Comment on PERL DB Optimization Download Code

Replies are listed 'Best First'.
Re: PERL DB Optimization by Tuppence (Pilgrim) on Dec 18, 2003 at 23:43 UTC
Before I offer my thoughts on your question I must first comment on one issue I see in your code, namely the use of bind params. Apologies if you already know this, but.. SQL queries should follow this form: `my $sth = $dbh->prepare('update data_set2 set data = ? where key2 = ?'); $sth->execute($val1, $key);` [download] This allows you to gain several benefits, including proper escaping for your data, cachability of statement handles (i.e. using the same statement handle for multiple updates, cutting time because the DB server does not have to re-parse the SQL) and you don't get people complaining at you to use bind params ;) Now, on to your question. I have done several different styles of what you are trying to do, and the highest performance I have been able to get out of the process is by putting more intelligence in the SQL and less in the perl. If you can make the database do more work, the perl has to work with less and will therefore run faster. If you do not have a database that supports subselects and joins, this may be harder then it otherwise would be. For instance, your example looks like 2 problems. creating new records for non existing updating records that already exist The first can be handled by getting a list from the database of only those records that don't exist, i.e. `SELECT id_field, field_to_update FROM table_1 WHERE id_field NOT IN (SELECT join_id_field FROM table_2)` [download] and the second can be handled with slightly more complicated SQL, like this: `SELECT src.id_field, src.field_to_update FROM table_1 src, table_2 dest WHERE src.id_field = dest.join_id_field AND src.field_to_update != dest.field_to_be_updated` [download] This will get you a list of the records that need to be updated. To take this a step further, if your database supports it you can even do the updates purely on the database side, although that query is much more difficult and I do not have time or inclination to figure it out for an example problem :) Hope that helps	[reply] [d/l] [select]
Re: PERL DB Optimization by Zaxo (Archbishop) on Dec 18, 2003 at 21:47 UTC
... the databases being written to are multi-user environments and cannot be locked because they are perpetually in use. Mmm, that sounds like locking is particularly necessary. Perhaps your rdbms system has record locking. It is hard to answer your question without knowing what these things are built on. You are sucking unspecifically large hunks of data into perl hashes. When memory gets tight, vm swapping is going to wreck performance. You might get better performance from stored procedures, a big redesign of the databases, or slaving some tables, but without specifics it's hard to say what you need to do. Being less ambitious about memory use may be enough to speed things for you. After Compline, Zaxo	[reply]