mkirank has asked for the wisdom of the Perl Monks concerning the following question:

Hi ,

We have an logparser that parses the logfiles and inserts data into database using unixodbc .The databases currently supported are (postgres ,oracle , ms-sql). the logfiles are huge (each log file is a few mb's) , i have read some article which says inserting bulk data into the database is faster instead of inserting one record at a time , presently the application is slow , it has to process all the logfiles (one by one) and insert data . if i do the following will it do a bulk insert ???
$sth = $dbh->prepare("insert into table (foo,bar) values (?,?)"); $tuples = $sth->execute_array(\%attr, \@list_of_param_array_refs);
Is there any other way to speed up the process or do a bulk insert (it should support all the databases)?
Thanks .....

Replies are listed 'Best First'.
Re: Parse Logfile and Bulk Insert
by Corion (Patriarch) on Jul 26, 2004 at 11:46 UTC

    Depending on your database and schema, you might want to drop the indices on the table, to the inserts and then rebuild the indices. Also depending on your requirements, you might want to disable the DBI AutoCommit for the database (and statement) handles and commit at the end of your program run or at other convenient points during execution.

    In the long run, there is nothing faster than the database supplied bulk loader program, sqlldr for Oracle, bcp.exe for MSSQL and whatever Postgres has. Postgres can directly import CSV files, for the others, you have to program the respective bulk loaders with the proper configuration file.

Re: Parse Logfile and Bulk Insert
by gaal (Parson) on Jul 26, 2004 at 11:48 UTC
    MySQL has its own bulk insert syntax, others may have such things too but of course you lose portaility if you depend on that.

    Make sure you have AutoCommit turned off (but remember to issue a commit when you're done!). This can significantly speed things up :-)

      Be mindful of maintaining integrity in your data. If you decide to drop the indices for importing (which does increase the speed) then the responsibility of integrity falls to you the importer.

      Turning Autocommit off is a great idea, which also speeds things up. However , unless you are certain your data is clean you may well want to turn autocommit off but commit on every valid transaction and rollback on invalid ones like so

      eval{$sth->execute("BLAH")} if($@) { $dbh->rollback(); } else { $dbh->commit(); }
      but with bulk insert I don't know if that's so much an option... so in other words you've gotta make sure that data is clean at some point in the process. your data importation method helps determine when this cleaning takes place.

      MS-SQL does have a BULK INSERT command that I'm aware of, check your docs, they are actually pretty good

      Best of luck on your problem,
Re: Parse Logfile and Bulk Insert
by mpeppler (Vicar) on Jul 26, 2004 at 15:07 UTC
    Real bulk-loading is faster - for example with Sybase the bulk-load API will create the binary image of the row on the client and send this image to the server. This means that the server only has to allocate the space for the new row and write the row to the disk, but doesn't have to perform any other work (such as converting between data types, maintaining indexes, calling triggers, etc.)

    The downside is that there is no consistent API for bulk-load calls, so portability is almost impossible to achieve.

    Michael

Re: Parse Logfile and Bulk Insert
by graff (Chancellor) on Jul 27, 2004 at 04:34 UTC
    or do a bulk insert (it should support all the databases)

    whew. As mentioned previously, the servers tend to differ in the details about how their bulk loaders work. But if you're currently supporting just three "brands" of servers, the mechanics are manageable, and with a sensible design involving config files and maybe brand-specific modules as well, it should be easily expandable. If you're supporting the same basic table schema across the different brands, it should be even easier.

    One thing you should be able to count on is that all of them will be able to use the same basic CSV format for the actual rows of data, and you'll be fine with that so long as you're working with "basic" (lowest-common-denominator) data types (shouldn't be a problem, if the input comes from logs).

    The differences involve how you convey field names and other control parameters to the bulk loader, the path/name of the bulk-loader program, what sort of command-line usage it supports, how it handles error conditions, logging, etc. Read each of the bulk loader manuals and practice, practice, practice. I think the payoff will be worth the effort -- I've only had experience with Oracle in this regard, but it's native bulk loader really is a lot faster than straight inserts with DBI, no matter how you tweak it.

Re: Parse Logfile and Bulk Insert
by mkirank (Chaplain) on Jul 27, 2004 at 09:03 UTC
    I am considering writing a C prgram to open the files ,parse the file and then insert data into staging tables(i think this might be faster).. from then i will write a perl script which does a "select into table from table " and filter out data ... what do you think of this option
    Thanks for the inputs ;-)