Is there a more efficient way?

lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello!
I have created a small mysql db with 2 tables. I have to insert data from 69 files into the two tables.
I am putting the files into an array for insertion.

I split the file, then insert data into a field in the first table. Next, I query that table to get the primary key info to then insert into the second table as a foreign key along with the other data that will be going into the fields.

 #!/usr/bin/perl -w
use strict;
use warnings;
use DBI;
################################################
# Load the files into the MySQL database       #
################################################

my $dbh = DBI->connect("DBI:mysql:dev:informatics", "mystuff", "gene77
+7" ) 
|| die "Could not connect to database:\$DBI::errstr";

# Put all of the files into an array for loading
my $count = 0;
# loop through the filehandles.treat $file as a normal
# file handle my @files
my @files;
foreach  my $file(@files)
 {
    open(my $file, "/home/mydir/data/test_files/$file");
    while(<$file>)
    {
        chomp;
        my @fields = split /\t/;
        my $well = $fields[0];
        my $gene = $fields[1];
        my $lt = $fields[2];
        my $barcode = $fields[3];
        my $name = $fields[4];
        # Insert data into tables
        my $sth1 = $dbh->prepare("INSERT INTO dev.blocks (names)
                            VALUES(?)");            
        # $name indices file are to be loaded into the 'blocks' table 
+'name' field
        $sth1->execute($name); 
        # query dev.blocks for id by passing in the $name 
        my $sth2 = $dbh->prepare("SELECT id from dev.blocks where name
+ = ?");
        $sth2->execute($name);
        # get the result from the blocks table and assign it to a vari
+able. 
        my @result = $sth2->fetchrow();
        my $id = $result[0];
        # Insert the variable into the foreign key block_id in the via
+ls table along 
        # with the other variables to be loaded into the vials table.
        my $sth3 = $dbh->prepare("INSERT INTO dev.vials (block_id, lt,
+ well_position, gene, barcode)
                            VALUES(?,?,?,?,?)");
        $sth3->execute($id,$lt, $well, $gene, $barcode);    
    }
close($ltvec_file);
$count++;
}
[download]

Is there a better way to do this?

Comment on Is there a more efficient way? Download Code

Replies are listed 'Best First'.
Re: Is there a more efficient way? by moritz (Cardinal) on Jul 29, 2009 at 19:54 UTC
A first step is to move the prepare() calls out of the loop, and reuse the statement handler they return.	[reply]
Re^2: Is there a more efficient way? by lomSpace (Scribe) on Jul 29, 2009 at 21:12 UTC
I tested this on a dir with 3 files and it is not inserting the data into the table fields. The tables are related via the $id(pk in blocks and fk in vials).<br Read more... (2 kB) Please excuse the previous typo.	[reply] [d/l]
Re^3: Is there a more efficient way? by moritz (Cardinal) on Jul 29, 2009 at 21:30 UTC
What do you want to tell me with the code you pasted? It still has the `->prepare` statements inside the loop, and at a casual glance I see no difference to what you posted originally. What did you change? I tested this on a dir with 3 files and it is not inserting the data into the table fields. Was it working in your original version? At least I didn't read anything about any failures. Anyway, you should set the RaiseError option on connecting to the database, see the DBI documentation.	[reply] [d/l]
Re^3: Is there a more efficient way? by mzedeler (Pilgrim) on Jul 29, 2009 at 21:25 UTC
The code you have posted above is identical to the code you posted first? Maybe its a mistake on my part.	[reply]
Re: Is there a more efficient way? by mzedeler (Pilgrim) on Jul 29, 2009 at 21:08 UTC
There is a one-to-many relationship between vials(block_id) and blocks(id), but your code doesn't seem to use it in any way. If there are many rows in vials for each row in blocks, you have the opportunity to optimize away many of the INSERT followed by SELECT on the blocks table. Storing the names (pluralisation is really confusing here) and id pairs in a hash will definately speed things up. Also, most databases with autonumbering offers a special function for retrieving the number which should perform better than doing a SELECT to find the latest inserted id. This will especially help if the names column isn't indexed.	[reply]
Re: Is there a more efficient way? by roboticus (Chancellor) on Jul 30, 2009 at 01:48 UTC
IomSpace: If speed is the issue, I'd use this approach: Create a surrogate temporary table with no indexes for each table you're loading. Bulk load your temporary tables. (I don't know MySQL, so I'm unfamiliar with its bulk data loader.) Insert new records from your temporary tables into your production tables. `-- Only insert them if they don't exist INSERT INTO dev.blocks (names) SELECT names FROM dev.tmp_blocks WHERE names NOT IN ( SELECT names from dev.blocks ) INSERT INTO dev.vials (block_id, lt, well_position, gene, barcode) SELECT block_id, lt, well_position, gene, barcode FROM dev.tmp_vials WHERE ... clause to prevent inserting duplicates ...` [download] Drop your surrogate tables. This way, you can let the database do the work as efficiently as possible. You'll reduce the number of transactions on the database, reduce the traffic between your server and the database server, reduce the I/O consumed by the database, etc. ...roboticus	[reply] [d/l]
Re: Is there a more efficient way? by scorpio17 (Canon) on Jul 30, 2009 at 14:05 UTC
There's always a better way. ;-) I'd do something like this (untested): my $sth1 = $dbh->prepare("INSERT INTO dev.blocks (names) VALUES(?)"); + my $sth3 = $dbh->prepare("INSERT INTO dev.vials (block_id, lt, well_po +sition, gene, barcode) VALUES(?,?,?,?,?)"); for my $file (@files) { open my $fh, '<', "/home/mydir/data/test_files/$file" or die "can't +open $file : $!\n"; while(my $line = <$fh>) { chomp $line; my ($well, $gene, $lt, $barcode, $name) = split(/\t/, $line); eval { $sth1->execute($name); }; if ($@) { die "database error 1: ", $dbh->errstr; } my $id = $dbh->{mysql_insertid}; eval { $sth3->execute($id, $lt, $well, $gene, $barcode); }; if ($@) { die "database error 2: ", $dbh->errstr; } } close($fh); } [download] Notes: Move the $dbh->prepare statements outside of the loop. Wrap the $sth->execute statements in an eval block, for error checking. Assuming your 'blocks' table has an auto-incrementing primary key, the $dbh->{mysql_insertid} statement saves you from having to do a select on the thing you just inserted. After the line with the split statement, you should probably add something like this for each field you expect to read in: `unless($well) { die "missing value for well in file $file, line $. \n"; }` [download] Note that the special variable $. contains the current line number of the file being read. You could also change the 'die' into a 'print' (to write errors into a log file, maybe), followed by a 'next' (to skip to the next line). The code you posted has a lot of problems. For example, you don't put anything into the @files array before looping over it - so that loop has nothing to work on. And you can't have a variable called $file that contains the file name and also use it as the file handle. And the file handle you close should be the same as the one you open, etc.	[reply] [d/l] [select]