DBI Question: bulk insertion with Unique-key constraint

lihao has asked for the wisdom of the Perl Monks concerning the following question:

Here is my scenario, I have over 200K records which I read from a CSV file and need to pre-proccess them(adjust_format() subroutine in the below code) before importimg them into MySQL DB.

In the table mytbl, there is an unique-key constraint by combining three columns. and it's possible that some input records have the same unique keys(so they are duplicated entries and should be kept only one copy). What I did so far:

# where MyApp::DBIx is a sub-class of DBIx::Simple,
my $csv = Text::CSV->new({ binary => 1 });
my $db  = MyApp::DBIx->new();

my $sql = qq[
    REPLACE INTO `mytbl` 
    VALUES (??)
];

while (my $row = $csv->getline($csvfile)) {
    adjust_format($row);
    $db->query($sql, @{$row});    
}
[download]

Note: If I use INSERT instead of REPLACE in the above SQL command, my script will be broken when any duplication found

So what's the best way to do bulk insertion with unique key constraints. The above code by insert(replace) one row each time is too slow. I know I can process each records in Perl and then print out tab delimited columnes into a file and then use a MySQL command like "LOAD DATA..."

  LOAD DATA INFILE '/path/to/import.dat' INTO TABLE `mytbl`;
[download]

This sounds OK.. My question is: are there better Perl ways to handle bulk data insertion, which can take care all the following stuff

Allowing unique keys
Having primary key auto_increment
pre-processing before inserting raw data
time-efficient, since I have over 200,000 records each time

Many thanks

lihao

Comment on DBI Question: bulk insertion with Unique-key constraint Select or Download Code

Replies are listed 'Best First'.
Re: DBI Question: bulk insertion with Unique-key constraint by dragonchild (Archbishop) on Apr 24, 2008 at 20:00 UTC
Some random thoughts: This isn't a Perl-specific question - it's a generic question about massaging data before inserting to a datastore. If speed is an issue, then the bulk-loader is the right way to do it. Determining uniqueness is a problem for a hash. Or, in your case, a hash-of-hashes-of-hashes (HoHoH) Auto-incrementing a PK is a matter of inserting NULL. Read the documentation for your bulk-loader. In fact, read the documentation for all the pieces you're working with. Most of the questions you have answered in the FAQ for each. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: DBI Question: bulk insertion with Unique-key constraint by lihao (Monk) on Apr 24, 2008 at 20:23 UTC
I know this is not a Perl-specific question. Just I want to find a nice Perl solution if there is any.. Many thanks..	[reply]
Re^3: DBI Question: bulk insertion with Unique-key constraint by dragonchild (Archbishop) on Apr 24, 2008 at 20:32 UTC
In other words, you saw the first of 5 points and completely ignored the rest of what I said. Inserting records through DBI will always be slower than the bulk-loader provided with the RDBMS. Often, 100x slower. Therefore, the proper solution is to prepare a file for the bulk-loader. You were worried about auto-incrementing primary keys. The solution in MySQL is to insert a NULL. In Oracle, the solution is different. Sybase has a third solution. You were worried about unique keys. Bulk-loaders provide a mechanism for solving this problem. Alternately, you should use a hash because that's the way to unique-ify a list of values. Open your head! My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re: DBI Question: bulk insertion with Unique-key constraint by perrin (Chancellor) on Apr 24, 2008 at 20:12 UTC
The `LOAD DATA INFILE` command will read your CSV file directly and you can have it either ignore duplicates or replace (i.e. keep the first or last). Just read the docs for that command and you should be all set. In general, don't use Perl for bulk data inserts. The MySQL bulk load is massively faster.	[reply] [d/l]
Re^2: DBI Question: bulk insertion with Unique-key constraint by lihao (Monk) on Apr 24, 2008 at 20:45 UTC
But I need to pre-process some columns, otherwise some data got corrupted, like datetime format. So I guess I need at least to use Perl to go through the CSV file and adjust some columns like datetime format, save them into another file and then load data with MySQL. :)	[reply]
Re^3: DBI Question: bulk insertion with Unique-key constraint by perrin (Chancellor) on Apr 24, 2008 at 21:20 UTC
Yes, doing a pass on the CSV file with perl and generating another CSV file to pass to MySQL is still probably faster than loading it row by row. By the way, INSERT IGNORE or INSERT ON DUPLICATE UPDATE are both typically faster than REPLACE if you do need to do it row by row.	[reply]
Re: DBI Question: bulk insertion with Unique-key constraint by thezip (Vicar) on Apr 24, 2008 at 20:06 UTC
IMO, the best way to avoid the problem is to not allow it to be a problem in the first place. You can do this by generating two files, one where all of the rows are guaranteed to unique, and then everything else would be treated as an exception. Your system will need to intelligently create your bulk insert files to provide the appropriate indices (if it's not able to do that inherently), and to detect for unique key collisions. I wouldn't think that time would be any consideration for just 200k records. For the exceptions file, you would then have your preprocessor generate the appropriate SQL script to facilitate the updates. Your wish is my commandline.	[reply]
Re: DBI Question: bulk insertion with Unique-key constraint by andreas1234567 (Vicar) on Apr 24, 2008 at 20:33 UTC
Yoy want two opposites: Speed and Integrity checks. I don't think they can easily be combined. How about the following: Bulk load the data with indices disabled (`alter table t disable keys`). Then enable the keys without enforcing uniqueness. Create queries to identify your duplicates, and deal with them (i.e. delete). Then enable the keys enforcing uniqueness I agree with perrin, use the database for what it's worth. -- No matter how great and destructive your problems may seem now, remember, you've probably only seen the tip of them. [1]	[reply] [d/l]
Re: DBI Question: bulk insertion with Unique-key constraint by pc88mxer (Vicar) on Apr 24, 2008 at 21:06 UTC
How about `LOAD DATA INFILE '...' REPLACE INTO TABLE '...'`? I also have a question about your schema... you already have a primary key (the auto_increment column.) Do you need a unique constraint on the other column set? If so, is `REPLACE` the correct action to take when you come across another row with the same key? Is keeping only the 'last' row for each key what you really want to do? Tell us more about your schema. You may not need the unique constraint (or dropping it might make your data more useful as you'll be keeping a history of all the records you've uploaded.)	[reply] [d/l] [select]