seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all

I've profiled a script of mine only to find that the biggest culprit in slowing down the script, is the function below.

As you can see, I've several tables, each of which are connected, in 'descending' order, by unique IDs from the table above.

So every model as the correct atom id, every atom has the correct residue id, and every residue has the correct chain id.

Every line I parse from a file, has all the data for all four tables in the same line, so I can do all 4 tables in one function. But the problem is that I want to be able to NOT query for any of the unique Ids.

But, I have to write an application that will be able to resume progress if it crashes or if the computer needs booting etc. so if it goes over any of the same files, the data is already there in the database, but how can I speed up the whole function, because every call is worth 0.000637 seconds, and that's huge, especially regarding the sheer volume of calls:

%Time    Sec.     #calls   sec/call  F  name
32.16 20144.1603  31635687   0.000637     DBI::st::execute
here's the function:

sub addAtom{ my $self = shift; my $pdb = shift; my $atom = shift; my $ch = $atom->chainId(); my $dbh = DBI->connect($dbi,$u,$p,{'RaiseError' => 1}); my $cid = $self->getCID($pdb,$atom->chainId(),$dbh); my ($rid,$aid,$mid,$res,$ato); if(!$cid){ $dbh->do("INSERT INTO chain (pdb,chain) VALUES ('$pdb','$ch')" +); $cid = $self->getCID($pdb,$atom->chainId(),$dbh); } if($cid){ $rid = $self->getRID($cid,$atom->resNumber,$atom->resName,$dbh +); if(!$rid){ $res=$dbh->quote($atom->resName()); $dbh->do("INSERT INTO residue (cid,rnumber,rname) VALUES ( +'$cid','".$atom->resNumber."',$res)"); $rid = $self->getRID($cid,$atom->resNumber,$atom->resName, +$dbh); } if($rid){ $aid = $self->getAID($rid,$atom->atomName,$dbh); if(!$aid){ $ato=$dbh->quote($atom->atomName()); $dbh->do("INSERT INTO atom (rid,aname) VALUES ('$rid', +$ato)"); $aid = $self->getAID($rid,$atom->atomName,$dbh); } if($aid){ $mid = $self->getMID($aid,$atom->model,$dbh); if(!$mid){ $dbh->do("INSERT INTO model (aid,model,x,y,z) VALU +ES ('$aid','".$atom->model."','".$atom->x."','".$atom->y."','".$atom- +>z."')"); $mid = $self->getMID($aid,$atom->model,$dbh); } } } } $dbh->disconnect(); }
Cheers
Sam

UPDATE:

Thanks for the replies, I had actually broached the subject of avoiding repeated connections in this thread.

I think the best solution to my problem is simply to use stored procedures, and also to keep a hash of the 'current' ids.

I have one more question though: Is it possible to get the perl DBI or MYSQL to return the newly created ID, so that I dont have to re-query for the id itself?

thanks
Sam

Replies are listed 'Best First'.
Re: Avoiding too many DB queries
by shemp (Deacon) on Jun 17, 2004 at 16:22 UTC
    You are probably losing A LOT more time connecting and disconnecting to the database for each call to your method. You should connect to the database once, and store the handle somewhere for re-use. This appears to be an object method, why not store the DB handle in the object?
    Also, since you're repeatedly doing effectively the same type of queries, you may want to pre-prepare the standaard queries with Bind Parameters, so the SQL engine doesnt need to re-parse the queries each time.
Re: Avoiding too many DB queries
by iburrell (Chaplain) on Jun 17, 2004 at 16:26 UTC
    The big optimization I see if to not make a new connection to the database for each call. I am assuming you are calling this method many times in a loop. Make the connection once, process the loop, and then disconnect.

    Another optimization is to cache the results if you do the same lookups repeatedly. For example, keep a hash that maps atom name to aid. If the mapping is not found, call the database, else use the value.

    Which database are you using? With some databases, it can't make sense to rely on the database to check the keys and report an error when a row already exists. Your code rolls back the whole transaction and goes on the next item.

Re: Avoiding too many DB queries
by EvdB (Deacon) on Jun 17, 2004 at 17:04 UTC
    The comments above suggest that you cut out the repeated database connections. This code might be a good place for you start:
    sub dbi_connect { my $self = shift; return $$self{dbh} ||= DBI->connect($dbi,$u,$p,{'RaiseError' => 1}); } sub dbi_disconnect { my $self = shift; return 1 unless $$self{dbh}; my $dbh = $self->dbi_connect; $dbh->disconnect || die $dbh->errstr; $$self{dbh} = undef; return 1; }
    Obviously you should only call dbi_disconnect when you are finished with the db handle. Alternatively just let it get disconnected automatically when the object goes out of scope.

    --tidiness is the memory loss of environmental mnemonics

Re: Avoiding too many DB queries
by shemp (Deacon) on Jun 17, 2004 at 18:26 UTC
    In regard to your update question of re-querying for the newly created ids, if the id fields are auto_increment, which i think they must be by the context of your question, use the query 'SELECT LAST_INSERT_ID()'

    Its still querying the DB, but it has almost no overhead, beacuse MySQL stores the most recently created auto_increment value for each connection. So after you do each insert, just do that query, instead of whatever your getRID(), get AID(), etc methods do.
Re: Avoiding too many DB queries
by Plankton (Vicar) on Jun 17, 2004 at 16:33 UTC
    Does the database you are using have stored procedures? If so use a stored procedure to do this.

    Plankton: 1% Evil, 99% Hot Gas.
Re: Avoiding too many DB queries
by jeffa (Bishop) on Jun 18, 2004 at 13:05 UTC

    Just a nitpick. Don't do this:

    $dbh->do("INSERT INTO chain (pdb,chain) VALUES ('$pdb','$ch')");
    Do this instead:
    $dbh->do("INSERT INTO chain(pdb,chain) VALUES(?,?)",undef,$pdb,$ch);
    and let DBI do the quoting for you.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Will this actually reduce any overhead?

      I've got too many statements all over my code to go through and update it as you recommend...

      Cheers
      Sam

        It is more of a security issue. You no longer have to wory about what is in those variables because DBI is going to make sure they are safe and not trying to break anything. It also means you can prepare it before hand using placeholders, then execute it multiple times later with different values, I beleive in that case it DOES reduce the overhead.


        ___________
        Eric Hodges