Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Best method to load a csv file.

by trammell (Priest)
on Dec 17, 2004 at 14:59 UTC ( [id://415647]=note: print w/replies, xml ) Need Help??


in reply to Best method to load a csv file.

Your database may have a load function or utility that lets you load CSV data, e.g. MySQL has LOAD DATA ... and mysqlimport. Either would likely be much faster than any Perl/DBI solution.

Replies are listed 'Best First'.
Re^2: Best method to load a csv file.
by samgold (Scribe) on Dec 17, 2004 at 15:12 UTC
    The problem with that is I have to manipulate the data before I can load it. I could manipulate the date and write it to a new file and then load it using SQL*Loader. I would rather just have one script to load the data. The amount of data that I am loading on a regular basis is anywhere from 300,000 to 400,000 records. That is spread out among 5 or 6 different files. Each file has a different format. The largest file, about 200k records, only takes a few hours to run. I am not too concerned about how long it takes. I just want to know if there is a better way to load the data using Perl/DBI.

    Thanks,
    Sam
      The database loading portion of the task is often resource intensive enough that you will want to use your database's import facility. Just process your data in your script prior to giving it to the database. That's the way most of us do that anyway. You should also be able to reduce your runtime from a few hours per file to a few minutes for everything if you do this.

        Plus, you can always call your database import utility from the same perl program that modifies your data when it's complete. With that number of records, I'd still write it out to a text file, though, before calling the external db load utility.

        Amatuers discuss tactics. Professionals discuss logistics. And... my cat's breath smells like cat food.
      I'd suggest a shell script that would fire off a SQL*Loader process for each of the different file layouts.

      From documentation on SQL*Loader:

      To load data from multiple datafiles in one SQL*Loader run, use an INFILE clause for each datafile. Datafiles need not have the same file processing options, although the layout of the records must be identical.

      Write a control file for each of the 5 or 6 different data file layouts then make sure to reference that in your calling statement.
      My experience with MS-Access leads me to believe that generating a new CSV file, in the format you want, is up to several orders of magnitude faster than inserting using DBI directly to insert every item into the database, one by one: a few seconds for generating the text files, as opposed to many minutes using DBI, for tables of a few ten thousands of records.

      So, As a generic approach, I'd advice you to generate the flatfiles with perl, and use the database's loader/import mechanism to actually read the data into the database.

      Note that reading all the data out of the database using Perl+DBI isn't nearly as slow as inserting/updating the records.

      update I notice that jdtoronto, based on his own, independent experiences, came to the same conclusion (point 3).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://415647]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (2)
As of 2024-04-20 06:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found