lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I recently loaded some data into a database. It worked fine
but it would be slow if I had a larger number
of records to load.

I am only working with four string types(TUP, TDP, TDF, TUF)
that exist in a db table. How can I look them up ahead of time and push
them into a hash (TUP=>1, TDP=>2, TDF=>3, TUF=>4).
That way I could just look up the id in the hash so that I don't have to query the table for every record.
That is what I am presently doing in my prepared statement $sth2.
#!/usr/bin/perl -w use strict; use warnings; use DBI; use Time::localtime; use DateTime; ######################################################## # This script creates new records in the oracle # # database for Taqman oligos. These are # # revised versions of oligos that previously failed # # and now the sequence and oligo_name are ready # # to be inserted into the oligos table. # ######################################################## my $now = DateTime->now(time_zone => 'floating'); my $yesterday = $now->subtract(days=>1); my $year = $yesterday->year; my $month = $yesterday->month; my $month_abbr = uc($yesterday->month_abbr);# uppercase string chars my $day = $yesterday->day; my $year_abbr = substr "$year",2,2; #creation date for loading into the database date fields my $create_date = "$day-$month_abbr-$year_abbr\n"; my $dbh = DBI->connect("DBI:Oracle:jweb", "genes", "biggenes") or die "Couldn't connect to database: " . DBI->errstr; open(my $oligo_file, "irFRP_Oligos.txt") || die "can't open file"; while(<$oligo_file>){ chomp; my @fields = split /\t/; my ($oligo, $seq) = ($fields[0], $fields[1]); # get project name from substring of $oligo # Create records for irregular oligos. Example, 12453TUR2. my $project_name = substr ($oligo,0,5); my $sth1 = $dbh->prepare("select project_id from projects where pr +oject_name = ?"); $sth1->execute($project_name); my @result1 = $sth1->fetchrow(); my $project_id = $result1[0]; # this is where I want to m my $oligos_type_name = substr ($oligo, -4,-1); my $sth2 = $dbh->prepare("select oligos_type_id from oligos_types +where oligos_type_name = ?"); $sth2->execute($oligos_type_name); my @result2 = $sth2->fetchrow(); my $oligos_type_id = $result2[0]; my $sth3 = $dbh->prepare("oligos_sequence.nextval from dual"); my @result3 = $sth3->fetchrow(); my $oligo_id = $result3[0]; my $sth = $dbh->prepare("insert into oligos (oligo_id, project_id, + oligo_type_id, oligo_sequence, oligo_name, oligo_create_user_id, oligo_modify_use +r_id, oligo_create_date, oligo_modify_date, oligo_disabled_flag +) values (?,?,?,?,?,?,?,?,?,?)"; my $rv = $sth->execute($oligo_id, $project_id, $oligo_type_id, $se +q, $oligo, '52', '52', $create_date, $create_date, '0'); } close $oligo_file;
Your wisdom is greatly appreciated! LomSpace

Replies are listed 'Best First'.
Re: How do I improve the effeciency of my script
by dreadpiratepeter (Priest) on Jul 21, 2009 at 01:56 UTC
    for starters I would move the prepare statements out of the loop. you can execute the same prepared statements multiple times. That ought to get you some major performance increase right away


    -pete
    "Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."
Re: How do I improve the effeciency of my script
by graff (Chancellor) on Jul 21, 2009 at 05:38 UTC
    The first reply is very important -- definitely prepare each statement once, and use it multiple times.

    Apart from that, if you don't have a really vast quantity of distinct "$project_name" or "$oligos_type_id" values, you might consider loading all of these into a hash before you start reading the data file:

    my $proj_id_sth = $dbh->prepare( "select project_name,project_id from +projects" ); $proj_id_sth->execute; my %proj_name_id; while ( my $row = $proj_id_sth->fetchrow_arrayref ) { my ( $pr_name, $pr_id ) = @$row; $proj_name_id{$pr_name} = $pr_id; }
    and similarly for the "oligos_type_name / oligos_type_id" mapping (update: that is, using a separate but similar %oligos_name_id hash). That way, you execute each of those queries just once before reading the file, and then as you go ahead and read the file, you are just using each "name" value as a hash key to look up the "id" value for your insert.

    One more thing to consider: instead of doing the inserts via DBI in your script, you could just output the rows of field values to be inserted to a tab-delimited (or other suitable) data file, then use your DB engine's native "import" or "loader" utility (sqlload for oracle, mysqlimport for mysql, etc) to apply the inserts based on your output list. That will be significantly faster than doing inserts with Perl/DBI. The difference becomes dramatic when you are looking at thousands of inserts in a single run.

      Thanks for the knowledge!
Re: How do I improve the effeciency of my script
by JavaFan (Canon) on Jul 21, 2009 at 10:16 UTC
    If you really have a lot of such strings, loading them all in a hash will cause you to run out of core memory - swapping isn't quite the powerbeast either.

    But if I look at your queries, you seem to do three selects, and use the results purely to populate a different database. This smells like no data needs to make the roundtrip; can't you just make a single insert statement that looks up values from the different tables? Possibly using a stored procedure?