Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I have a question on how to reduce the time taken to run a script that involves three queries wen individual run on Oracle.
1. A query to get get all the samples takes about "Elapsed: 00:00:00.30"
2. A query to get the snps takes about "Elapsed: 00:00:00.7"
3. To get the genotypes for each sample and on each snp."Elaesed 00:00:00.1(for single sample and single snp)

My general structure of the code is something like this:
foreach my $ind( @ {$indsent} ){ #where indsent is the list of sample 21 my $i = 1; 22 foreach my $snp (@ {$snparray} ){ list of snp 23 24 my $genotype = $conn->executeArrRef('snp::Sequenom::getGenotyp +eCalls', $ind->[1], $snp->[1]); 25 26 foreach my $geno( @ {$genotype} ){ 27 28 if(defined $geno->[9]){ 29 30 31 print "$i,$ind->[0],$geno->[9],$snp->[2],$snp->[0]\n"; 32 33 } } }
Could someone suggest, is there any alternate way of doing this? I also tried dumping the snparray, and indsent to a hash and passing the values to fetch the genotype. But both of them are equally bad :)
I feel very bad to say, but it takes nearly an hour to just fetch the data from the Db. Though the query is not bad.( If it would be of any help; I will have 4K samples and 30 -90 snps.)

Thanks for your time and suggestion in advance :)

Replies are listed 'Best First'.
Re: How to increase the efficiency of the code wen working with DB?
by roboticus (Chancellor) on May 26, 2011 at 12:34 UTC

    You don't show which driver you're using, nor the queries you're sending, nor how you're fetching the data. So I don't see how I can offer any suggestions for improving your database access.

    However, you do have three nested loops. Perhaps you're fetching the same data repeatedly? If so, you might get better performance by caching the data and fetching it only once.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Well, I thought its not to concern wthe query or the database in this instance, I think(may be I am wrong) the issue is the way I go through the loop. (foreach sample and for each snp)
      but here it is:
      $conn = XXXXX::Database::Conn->new('live'); #its the wrapper build aro +und the actual DBI module to facilitate the ease of fetching the data +. $conn->addConnection(DBI->connect('dbi:Oracle:SNP.world','snp','snp', +{RaiseError =>1 , AutoCommit=>0}),'snp');

      My queries
      select distinct i.id_ind from dna_plate p,dnaplate_status ps, dna_wel +l w, individual_name i where p.id_dnaplate = w.id_dnaplate and w.id_c +ontent = i.id_ind and p.plate_type = 13 and ps.id_dnaplate = p.id_dna +plate and ps.status = 7 and p.id_study in(6637,7867) and i.name_type += 2 order by i.id_ind select ss.id_snp from well_assay wa, snpassay_snp sas, snp_summary ss where wa.id_assay = sas.id_assay and sas.id_snp = ss.id_snp and wa.id_well in ('W30467', 'W30569', 'W30572') order by 1 select distinct 'C' as proc_status,nvl(o.abbreviation, substr(translat +e(o.orgtitle, ' ', '_'), 1, 15)) || ':' || nvl(inm.name, 'NOEXTNAME') + as suppname,i.id_ind,i.clonename,g.id_snp,wa.id_well as plex,nvl(dp. +id_dnaplate, 0) as id_dnaplate,nvl(upper(rm.maprow), '-') || nvl( tri +m(to_char(rm.mapcol, '09')), '-') as wellrc,nvl(to_char(g.id_dnawell) +, 'UNKNOWN') as id_dnawell,g.genotype,g.confidence,nvl(dp.plate_name, + 'NOPLATENAME') as plate_name from (individual i left outer join indi +vidual_name inm on i.id_ind = inm.id_ind and inm.name_type = 2), (genotype g left outer join ((dna_well dw join dna_plate dp on dw.id_d +naplate = dp.id_dnaplate) join rowcol_map rm on dp.format = rm.format + and dw.id_map = rm.id_map) on g.id_dnawell = dw.id_dnawell), organisation o, well_assay wa where nvl(i.supplier, 0) = o.id_org and i.id_ind = g.id_ind and g.id_assay = wa.id_assay and i.id_ind = ? and g.id_snp = ? and g.disregard = 0 order by 2, 3, 4, 5, 6
Re: How to increase the efficiency of the code wen working with DB?
by MidLifeXis (Monsignor) on May 26, 2011 at 13:15 UTC

    Is there a way that you can let the database handle the loops? Perhaps rewriting two or three of your queries as joined queries, and returning the data all at once would allow the DBMS to optimize the query a bit.

    --MidLifeXis

Re: How to increase the efficiency of the code wen working with DB?
by anonymized user 468275 (Curate) on May 26, 2011 at 13:07 UTC
    My first reaction is try with GROUP BY instead of DISTINCT - it gives many RDBMSs the chance to use a faster internal algorithm. Also it doesn't look so much like you're doing it on the face of it, but just in case: avoid reconnecting to the database between queries - maintain a single DBI database handle unless you are doing parallel processing. Consider using multiple database handles and forks (updated and see also Parallel::ForkManager) if the opportunity for parallel processing exists to save elapsed time, especially if there are many such iterations to parallelise.

    One world, one people

Re: How to increase the efficiency of the code wen working with DB?
by tospo (Hermit) on May 26, 2011 at 15:02 UTC
    How long does it take to run those queries in a shell, i.e. not through your Perl module? I think the issue is likely to be one of the database itself and not one of your client script. First and foremost I would check indices of the relevant tables in that schema. If the query can not be run efficiently because no suitable index can be found then it's going to take forever.
      1. A query to get get all the samples takes about "Elapsed: 00:00:00.30" 2. A query to get the snps takes about "Elapsed: 00:00:00.7" 3. To get the genotypes for each sample and on each snp."Elaesed 00:00:00.1(for single sample and single snp)
        That doesn't look too bad but of course you have to do the last one almost 400000 times, so that will take a while. I haven't looked at your queries too closely but is there a reason why you need to issue a query for genotypes with a single sample and a single snp instead of querying for all samples IN (...) and all snps IN (...) in one go? That should be a lot faster. You can associated sample, genotype and snp in your perl script by parsing the results. Other than that I would still look at the indices, which might not be optimal for your queries, and also at the sorting you let the DB do - that too will take time and might not be necessary, depending on how you use the data afterwards.
Re: How to increase the efficiency of the code wen working with DB?
by Anonymous Monk on May 27, 2011 at 10:56 UTC
    Don't run 40,000 queries. Run one query that returns 40,000 rows, even if you have to loop repeatedly through those rows reading the same result-set again and again and again.