http://qs1969.pair.com?node_id=415644

samgold has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks,

I am loading a lot of csv files into our database. My question is, what is the best way to do it? Currently I am using Text::ParseWords for the csv files and DBI for my database. I have never used DBI::CSV or Text::csv and not sure which I should use. If there are any other options please let me know.

Thanks,
Sam

Update: What about Text::CSV::Simple?

Replies are listed 'Best First'.
Re: Best method to load a csv file.
by trammell (Priest) on Dec 17, 2004 at 14:59 UTC
    Your database may have a load function or utility that lets you load CSV data, e.g. MySQL has LOAD DATA ... and mysqlimport. Either would likely be much faster than any Perl/DBI solution.
      The problem with that is I have to manipulate the data before I can load it. I could manipulate the date and write it to a new file and then load it using SQL*Loader. I would rather just have one script to load the data. The amount of data that I am loading on a regular basis is anywhere from 300,000 to 400,000 records. That is spread out among 5 or 6 different files. Each file has a different format. The largest file, about 200k records, only takes a few hours to run. I am not too concerned about how long it takes. I just want to know if there is a better way to load the data using Perl/DBI.

      Thanks,
      Sam
        The database loading portion of the task is often resource intensive enough that you will want to use your database's import facility. Just process your data in your script prior to giving it to the database. That's the way most of us do that anyway. You should also be able to reduce your runtime from a few hours per file to a few minutes for everything if you do this.
        I'd suggest a shell script that would fire off a SQL*Loader process for each of the different file layouts.

        From documentation on SQL*Loader:

        To load data from multiple datafiles in one SQL*Loader run, use an INFILE clause for each datafile. Datafiles need not have the same file processing options, although the layout of the records must be identical.

        Write a control file for each of the 5 or 6 different data file layouts then make sure to reference that in your calling statement.
        My experience with MS-Access leads me to believe that generating a new CSV file, in the format you want, is up to several orders of magnitude faster than inserting using DBI directly to insert every item into the database, one by one: a few seconds for generating the text files, as opposed to many minutes using DBI, for tables of a few ten thousands of records.

        So, As a generic approach, I'd advice you to generate the flatfiles with perl, and use the database's loader/import mechanism to actually read the data into the database.

        Note that reading all the data out of the database using Perl+DBI isn't nearly as slow as inserting/updating the records.

        update I notice that jdtoronto, based on his own, independent experiences, came to the same conclusion (point 3).

Re: Best method to load a csv file.
by jdtoronto (Prior) on Dec 17, 2004 at 18:29 UTC
    Much of my time is spent doing just this. Several comments:
    • Text::CSV_XS has been the most reliable I have found. It is flexible and so far I haven't found a 'CSV' file I cannot handle. Text::CSV::Simple is too simple and won't write CSV format.
    • When it comes to processing the data. In some cases I import the CSV file into a temp table, using the LOAD functions of MySQL. Then I process the data in the table. This is useful in Tk based apps where I can use Tk::DBI::Table to preview the data for users, especially where they are doing something like mapping input data fields to our database structure.
    • Where I have bulk data or something is done routinely in a known structure I do as has been suggested earlier. I read the CSV file, process it, spit it out as a file again then use the LOAD from within the Perl script to have MySQL import it. The speed advantage is amazing! I have one job which imports around 100,000 records per day, using DBI to insert them after reading the file 7-8 minutes. Pre-processing takes about 35 seconds and the MySQL load averages 170ms!
    Good luck! I hope this helps.

    jdtoronto

      jdtoronto wrote:
      Text::CSV_XS has been the most reliable I have found. It is flexible and so far I haven't found a 'CSV' file I cannot handle.
      Well I certainly have, though I don't remember what the difficulties were in detail now (I think it was something like the "csv" file had spaces after the commas -- the trouble with csv is that there is no standard for it but defacto standards).

      I've heard that DBD::AnyData with trim=>1 can deal with spaces after commas, but I haven't tried it myself.

      With Text::CSV_XS, you almost certainly want to use the "binary" option. Otherwise you'll have problems with values that have extended characters or embedded newlines.

      DBD::CSV uses Text::CSV_XS internally, so if Text::CSV_XS (with binary on) is no good, don't expect DBI::CSV to do any better. If I remember right, there's something a little screwy with the way DBD::CSV converts the header row into database column names (e.g. you may have trouble if there are spaces in your column descriptions). Either fix-up the first row of your csv file manually, or look for a way to tell it what names of the columns will be over-riding the header row (as I remember it, there *is* a way, though I don't see it in the man page at the moment).

      You should take a look at this: dbi_dealing_with_csv

Re: Best method to load a csv file.
by jcoxen (Deacon) on Dec 17, 2004 at 16:53 UTC
    Re: Text::CSV::Simple - it does just what it's names implies. It makes loading CSV files simple. The thing is, you still have to manipulate them once you've loaded them. And it doesn't help you write the files out to your database.

    I'm using that module right now on a project I'm working on to load a bunch of CVS files. Some of them, I only load certain fields, some I load everything, some I load everything but then throw away certain fields or load a pair of fields into a hash. Text::CSV::Simple is a great help for all of this...but only on the input side.

    Import the CSV files, manipulate them as necessary and then take trammell's suggestion and use the data import function on your database program.

    Jack

    Cogito cogito, ergo cogito sum
    (I think I think, therefore I think I am)
Re: Best method to load a csv file.
by jZed (Prior) on Dec 17, 2004 at 18:30 UTC
    If at all possible, use a loader for your database. If not, and you use DBD::CSV, you can simplify your code by using SQL to both access the CSV and load the data. If you want the fastest and most flexible option, use Text::CSV_XS, *not* Text::CSV.
Re: Best method to load a csv file.
by CountZero (Bishop) on Dec 17, 2004 at 23:56 UTC
    I had a similar problem: differently formatted CSV-files to be unified and entered into one DB.

    Best approach (for me) was to write a script for each type of CSV-file (using Text::CSV_XS) and dumping the data from these CSV-files into a temporary table (either through a loader or with DBI and then finally transfer the data from the temporary table to the main DB (again with DBI.

    A dedicated loader is indeed faster than DBI, but if you take into account the time to write out the "unified" CSV-file as well, I'm not so sure that you win much time in the end.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Best method to load a csv file.
by samgold (Scribe) on Dec 17, 2004 at 19:33 UTC
    Thank you all for your suggestions and advice.

    Thanks,
    Sam