in reply to Implementing a buffered read-and-insert algorithm

Often threading really helps an appplication's speed and repsponsiveness, but I don't think it will help speed up your application here. The reason is that both your CSV files and you database are on the same disk drive, but at different disk locations. So alternating between threads will only cause disk thrashing as the disk head continually seeks between the CSV and database locations. On the other hand reading sequential rows of a CSV file from disk is a relatively fast operation compared to seeks, so that is the better algorithm here.

Along the same lines, if you can, read as much of the CSV file into memeory as you can each time. That maximizes sequential reads vs. seeks and should speed up disk I/O on the CSV side. You may also want to see if your chosen database has a "loader" program that is purpose built to quickly import large amounts of data into a database.

-Mark

  • Comment on Re: Implementing a buffered read-and-insert algorithm

Replies are listed 'Best First'.
Re^2: Implementing a buffered read-and-insert algorithm
by radiantmatrix (Parson) on Dec 13, 2004 at 21:47 UTC

    > The reason is that both your CSV files and you database are on the same disk drive, but at different disk locations

    Nope. The CSV files and the OS are on one volume, and the database is on a separate RAID volume set.

    > You may also want to see if your chosen database has a "loader" program that is purpose built to quickly import large amounts of data into a database.

    That would seem to make the most sense on its face. However, this tool will be useful for a number of DBs, and there is a good chance that there will be pattern matching later as a sort of "pre-filter", and the occasional manipulation of data before insertion.

    radiantmatrix
    require General::Disclaimer;
    s//2fde04abe76c036c9074586c1/; while(m/(.)/g){print substr(' ,JPacehklnorstu',hex($1),1)}

      this tool will be useful for a number of DBs, and there is a good chance that there will be pattern matching later as a sort of "pre-filter", and the occasional manipulation of data before insertion.

      Even so, it is generally a common feature shared by all the major DB servers that each has an "import-from-data-file" tool or function, which is normally quite easy to apply, and will be highly optimized for loading data into the given server. This will always beat the pants off doing inserts with Perl/DBI on really large data sets, even when you throw in the time it takes to read the man page for the DB's import tool (which shouldn't take more than 15 minutes).

      As for any pre-filtering or editing of the data, setting up a perl script to do just this step will generally be a better solution (cleaner, quicker, easier to debug/validate/update), as opposed to having this kind of logic in the same app with the relatively complicated i/o methods you're looking at (along with DBI stuff too). In other words, given a huge CVS CSV stream that isn't exactly what you'd want to feed into the DB, write a script that just creates the exact CVS CSV stream that you do want to import. Then use the DB server's native import tool to load the filtered/edited data file.

      If you've got plenty time and CPU cycles to burn, go ahead and experiment with trying to approximate what DB-specific loaders already do, but if you're in a hurry and/or have a lot of different instances (data sets and/or DB servers) to worry about, keep it simple by using the existing tools for the job, and only invent the ones that don't exist yet.

      update: In case it helps, bear in mind that it would be both easy and efficient to develop a perl wrapper for each of the DB-native import tools. This might not seem worthwhile until you have at least two different DB servers to work with, but once you see what the similarities and differences are in the DB loader tools, I think you could get a lot of mileage and satisfaction out of "objectifying" and "modularizing" that aspect of the problem -- Perl could provide a handy layer of abstraction to hide the differences among DB loaders, and your development time (and run time) will be much better spent in that direction.