awohld has asked for the wisdom of the Perl Monks concerning the following question:

I'm running a bunch of database queries using DBD::CSV and it runs pretty slow, does anyone have some suggestions on making this faster? Here's the DB part of my code:
my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;") or die "Cannot connect: " . $DBI::errstr; $dbh->{'csv_tables'}->{'bt'} = { 'file' => './X_Con.csv'}; $dbh->{'csv_tables'}->{'sec'} = { 'file' => './Y_Con.csv'}; $dbh->{'csv_tables'}->{'stats'} = { 'file' => './Z_Con.csv'}; my $sth_sec = $dbh->prepare("SELECT KEY, TEST, INSTANCE, ID, HID FROM +sec") or die "Can't prepare SQL statement: $DBI::errstr\n"; my $sth_bt = $dbh->prepare("SELECT NAME, STATE FROM bt WHERE ID =? AND + HID=?") or die "Can't prepare SQL statement: $DBI::errstr\n"; my $sth_stats = $dbh->prepare("SELECT CCS, kbps, MBytes FROM stats WHE +RE ID=? AND HID=? & INSTANCE=?") or die "Can't prepare SQL statement: $DBI::errstr\n"; $sth_sec->execute() or die "Can't execute SQL statement: $DBI::errstr\n"; open(TRAFFICSTATS,'>',"NoThroughput.csv") || die("Cannot Open File"); { my $old_sel = select(TRAFFICSTATS); $| = 1; # Auto-flush. select($old_sel); } print TRAFFICSTATS "KEY,ID,HID,NAME,STATE,CCS,kbps,MBytes\n"; while ( my @row = $sth_sec->fetchrow_array ) { my ( $keyId, $testCarrier, $instanceId, $id, $hId ) = @row; $sth_bt->execute( $id, $hId ); @row = $sth_bt->fetchrow_array; my ( $name, $state ) = @row; $sth_stats->execute($id, $hId, $instanceId ); @row = $sth_stats->fetchrow_array; my ( $CCS, $kbps, $mBytes ) = @row; print TRAFFICSTATS "$keyId,$id,$hId,$name,$state,$CCS,$kbps,$mBytes +\n"; } close TRAFFICSTATS;

Replies are listed 'Best First'.
Re: How to make my DBD::CSV DB code faster.
by CountZero (Bishop) on Sep 29, 2005 at 21:51 UTC
    It is by its very nature slow as it has to read and write a lot to simple files and uses SQL::Statement as the SQL-engine, so there is little place for improvements unless you feed the CSV-files to a real database. SQLite or MySQL spring to mind in that respect.

    CountZero

    PS: Merlyn will probably say that MySQL is not a real database and that you should use Postgresql.

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: How to make my DBD::CSV DB code faster.
by graff (Chancellor) on Sep 29, 2005 at 22:29 UTC
    If performance is an issue, I'd tend to agree with the first reply: maybe it's time to start looking at a real database engine.

    If you don't think you're ready for that (... aw c'mon! why not? well, anyway...) then you should consider changing from this approach:

    foreach row (id,hid,instance,etc...) in table1 lookup row in table2 that contains id,hid lookup a row in table3 that contains id,hid,instance
    to something like this:
    load all of table2 into a hash, keyed by "id,hid" load all of table3 into a hash, keyed by "id,hid,instance" foreach row (id,hid,instance,etc...) in table1 do something with row data, $table2{"id,hid"} and $table3{"id,hid,i +nstance"}
    Since you're using DBD:CVS instead a real database server, I'm assuming that the amounts of data in table2 and table3 are small enough to fit comfortably in memory. If that's not true, then you really should think seriously about mysql or postgres.
      Your second approach is pretty much how SQL::Statement handles joins, it loads each table only once and then searches in-memory hashes.
Re: How to make my DBD::CSV DB code faster.
by jZed (Prior) on Sep 30, 2005 at 02:36 UTC
    While I agree with the other posters that for large datasets and cases where speed is critical, you may be better off with one of the "real" databases, but to say it's DBD::CSV and therefoe "there is little place for improvements" is much too broad a statement. Yes DBD::CSV can be slow, but that doesn't mean there is never a way to optimize. For example I suspect that your code will run much faster if you replace all of your prepares and executes and loops and file opening and file writing with this one statement, it will (if I've understood what you're doing) create a new table with the join of the rows from the other table three tables.
    $dbh->do(" CREATE TABLE trafficstats AS SELECT key, sec.id, sec.hid, name,state,ccs,kbps,mbytes FROM sec,bt,stats WHERE sec.id = bt.id AND sec.hid = bt.hid AND sec.id = stats.id AND sec.hid = stats.hid AND sec.instance = stats.instance ");
    This does only three physical file searches (one for each table) whereas your loop would do twice as many physical file searches as there are rows in the sec table (it would have to search each of the other tables for each row in sec). Please let me know if it improves your time and accomplishes what you want to accomplish
Re: How to make my DBD::CSV DB code faster.
by davidrw (Prior) on Sep 30, 2005 at 02:21 UTC
    You can also use DBD::AnyData instead of DBD::CSV and take advantage of DBD::AnyData's in-memory tables:
    use DBI; #my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;") or die "Cannot +connect: " . $DBI::errstr; #$dbh->{'csv_tables'}->{'bt'} = { 'file' => './X_Con.csv'}; #$dbh->{'csv_tables'}->{'sec'} = { 'file' => './Y_Con.csv'}; #$dbh->{'csv_tables'}->{'stats'} = { 'file' => './Z_Con.csv'}; my $dbh = DBI->connect('dbi:AnyData(RaiseError=>1):'); $dbh->func( 'bt', 'CSV', './X_Con.csv', 'ad_import'); $dbh->func( 'sec', 'CSV', './Y_Con.csv', 'ad_import'); $dbh->func( 'stats', 'CSV', './Z_Con.csv', 'ad_import');
    And rest off your code should work unchanged, and will be much faster because it's reading from RAM instead of disk.
    note (see the docs) that any changes to those tables have to be explicitly written to disk if you want them saved.

    For even better performance (though the above will probably be sufficient for you), see graff's hash-it-up solution.
      The in-memory tables are now a feature of all SQL::Statement DBDs, so with DBD::CSV you can CREATE TEMP TABLE foo ... to create an in-memory table, you don't need to switch to DBD::AnyData for that.
Re: How to make my DBD::CSV DB code faster.
by dragonchild (Archbishop) on Sep 30, 2005 at 01:57 UTC
    A third option is to help improve the code for SQL::Statement and the other underlying technologies. That way, you improve the speed of your code and you give back to the community that has helped you solve your problem so quickly and easily.

    I like that option. :-)


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      I like that option too, patches are welcome :-). In this particular case, though, I suspect the OP's loop approach is a significant portion of the problem.