radiantmatrix has asked for the wisdom of the Perl Monks concerning the following question:

Great and Wise Monks, I humbly request your Wisdom:

I am working on a large DB project. Files are loaded, line-by-line, into a "cache" table, and each line is tagged with a Session ID sid corresponding to when it was created.

When the sessions are copied to the "live" tables, more recent data with the same key values must replace older data. If the data doesn't exist (the more frequent case), then it must be inserted.

I've got something that works, but it's not as fast as I'd like it to be, and the code is definately in "ugly hack" status. I'm looking for a better, cleaner, faster way to do what I need.

Update: I have, in fact, found a better way to do this that combines dragonchild's and Roy Johnson's ideas. Thanks for the help, monks!

Currently, I read a "Map" file which defines the keys for the table, among other things, and extract a list of sessions that need updating into @session. Then I use this chunk of code:

foreach (@session) { $db->{_dbh}->{PrintError}=$main::opt_verbose; #supress errors unle +ss verbose $db->prepare($select) or die("Unable to prepare SELECT"); $db->execute($_) or die("Unable to execute SELECT"); my $session = $kit->session(id=>$_, status=>'locked'); $log->say("Processing $_"); my $sth = $db->{_sth}; my @records = @{$sth->fetchall_arrayref}; SCRUB: my $inserted; my (@to_update, @skipped, @upd_error, @ins_error); $session->update($db); $log->say("Inserting rows..."); $log->indent; $db->prepare($insert) or die("Unable to prepare INSERT"); while (@records) { my @row = @{shift @records}; if ($db->execute(@row)) { print STDERR "."; #if ($main::opt_verbose); $inserted++; } else { print STDERR "o"; #if ($main::opt_verbose); push @to_update, \@row; push @ins_error, $db->errstr; chomp $ins_error[$#ins_error]; } }#-while : done inserting #Process to_update rows print STDERR "\n"; $log->unindent; $log->say("Inserted $inserted rows, now updating $#to_update r +ows."); $log->indent; $db->prepare($update); if (@{$map{options}{key}}) { foreach (@to_update) { #my $row = $_{row}; #never mind the errstr, for now. my @upd_params = @{$_}[0 .. $#{$map{options}{key}}]; if ($db->execute(@{$_}, @upd_params)) { print STDERR "."; # if ($main::opt_verbose); } else { print STDERR "o"; # if ($main::opt_verbose); push @skipped, $_; push @upd_error, $db->errstr; chomp $upd_error[$#upd_error]; } }#-for } else { @skipped = @to_update; for (@ins_error) { push @upd_error, ""; } #define @upd_err +or, to avoid later issues. }#-if $log->unindent; #Clean up if (@skipped) { open SKIPF, '>>', "session.skip" or $log->warn("Unable to +record skipped lines to file"); print SKIPF "\n===> $_\n"; while (@skipped) { print SKIPF join(chr(9), shift @ins_error, shift @upd_ +error, @{shift @skipped})."\n"; } close SKIPF; $log->warn("$#skipped records thrown to skip file"); } else { push @skipped, 'none'; #allows $#skipped to be 0 } $log->say("Finished with $_, $inserted new rows, ".$#to_update +-$#skipped." updated."); #-SCRUB
Note that I use some custom modules to simplify certain tasks. $db is a custom wrapper for DBI that logs DBI messages to a file (and some other stuff). $log is a custom logger that fits my employer's anal logging requirements.

Obviously, there are items here which only come into play later in the script -- I've left them in out of a concern that I might accidentally trim too much.

Essentially, I select all the records into an array-of-arrays and try to insert them. If that operation fails (DBI raises an error, which causes my custom $db->execute to return undefined), I add that row to the @to_update array.

I then make a pass on the @to_update array, attempting to use the database keys to execute an UPDATE on the correct rows. Any that fail that operation are added to @skipped; those are later recorded to a SKIP file for later analysis.

Obviously, since the data is being loaded by my script, then we try to send it back (sometimes twice), this can be extremely slow. Stored Procedures are explicitly excluded by the project requirements. My supervisor thinks this performs acceptably, but I'm trying to learn to be an efficient coder.

The Wisdom of the Monks is requested to find a Better Way.

--
$me = rand($hacker{perl});

Replies are listed 'Best First'.
Re: Superior way to update SQL tables
by Roy Johnson (Monsignor) on Sep 08, 2004 at 14:29 UTC
    This response has nothing to do with Perl, but you should be able to process your data with one insert and one update statement. One would be of the form
    UPDATE real_table R SET (col1, col2, col3,...) = SELECT col1, col2, col3,... FROM scratch S WHERE R.key_val1 = S.key_val1 AND R.key_val2 = S.key_val2 AND ... ;
    and the other would be like
    INSERT INTO real_table R SELECT * FROM scratch S WHERE NOT EXISTS ( SELECT 1 FROM real_table WHERE key_val1 = S.key_val1 AND key_val2 = S.key_val2 AND ... )
    The snag with that sounds like some of the inserts might fail (since you mention skipping). At least you should be able to do the UPDATE, and delete those records. Then you can loop through what's left for INSERT/SKIP.

    Caution: Contents may have been coded under pressure.
      I know that I am not wise in the ways of SQL (just a newbie). However, how would this let me know which records failed both INSERT and UPDATE? I need to record them (the entirety of the data).

      What am I missing?

      --
      $me = rand($hacker{perl});
        I assumed that updates would not fail. If updates can fail, too, then you're pretty much stuck going through row-by-row. You could at least determine which rows should be inserts and which should be updates, in advance, by checking whether a corresponding row exists in the real table:
        $to_insert_stmt = q{ SELECT * FROM scratch S WHERE session_id = ? AND NOT EXISTS ( SELECT 1 FROM real_data WHERE key_val1 = S.key_val1 AND ... ) }; $to_update_stmt = q{ SELECT * FROM scratch S WHERE session_id = ? AND EXISTS ( SELECT 1 FROM real_data WHERE key_val1 = S.key_val1 AND ... ) };
        Then any rows that fail insert are skips, and any rows that fail update are skips, and you don't waste time trying to insert rows that should be updates.

        Caution: Contents may have been coded under pressure.
Re: Superior way to update SQL tables
by dragonchild (Archbishop) on Sep 08, 2004 at 14:30 UTC
    First off, have you correctly defined primary and unique keys? If you haven't, this is going to be extremely difficult.

    Second, what database are you using? MySQL has the REPLACE command, which was designed to do what you're asking. Otherwise, you can depend on your primary and unique keys to fail appropriately, allowing you to do something like:

    my $dbh = DBI->connect( $dsn, $user, $password, { PrintError => 0, RaiseError => 1, }, ) or die $DBI::errstr; foreach (@whatever) { eval { $dbh->do( $insert_statement, {}, @values ); }; if ($@) { $dbh->do( $update_statement, {}, @values ); } }

    The theory is that if you insert something that isn't there yet, you're fine. If it's there, the primary and unique keys will complain, so you do an update.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      The methodology you describe is exactly what I am doing. The idea of attempting the update from within the fail block is interesting, and something I hadn't considered. Thank you!

      I note that you use $dbh->do rather than prepare/execute. Obviously, that is easier, but are there caveats to running several million do's as opposed to one prepare and millions of execute(@values)?

      Also, is there a good reason not to do:

      foreach (my @row = $sth->fetchrow_array) { eval { $dbh2->do( $insert_statement, {}, @row ); }; if ($@) { eval { $dbh2->do( $update_statement, {}, @row ); }; if ($@) { addto_Skip(@row); } } }
      ?
      --
      $me = rand($hacker{perl});
        I used do() for illustrative purposes. I would actually recommend using prepare_cached()/execute() instead of prepare()/execute(). do() has to re-prepare the statement, which can take time.

        As for your addto_Skip(@row) thing, that's up to you. If you expect to have rows that you cannot insert or update, that would be a useful thing to have.

        You can also do something like:

        while (my @row = $sth->fetchrow_array ) { eval { $sth_insert->execute( @row ) }; next unless $@; eval { $sth_update->execute( @row ) }; next unless $@; addto_Skip( @row ); }

        Though, personally, I would look at using fetch() with bind_columns(), as the best performance option that DBI supports. This is instead of the fetchrow_array().

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested