SRick has asked for the wisdom of the Perl Monks concerning the following question:

I need to create a perl script which reads in log files (.csv compressed as .gz) then inserts them into the appropriate table in a SQL Server database based on the filename. I am fairly new to perl but have come up with a solution that works, albeit incredibly slowly.

These log files range from 70,000-150,000 records per file and we get 12-16 logs per day. It took over 30 minutes to complete the script on one file of ~75,000 records. Can someone look at my code and see if there is anything which could be changed to improve performance?

use strict; use warnings; use DBI; use Compress::Zlib; use File::Find; # DBD::ODBC my $dsn = 'DBI:ODBC:Driver={SQL Server}'; my $host = 'host ip, port'; my $database = 'ProjectTest'; my $user = 'user'; my $auth = 'password'; # Connect via DBD::ODBC by specifying the DSN dynamically. my $dbh = DBI->connect("$dsn;Server=$host;Database=$database", $user, +$auth, { RaiseError => 1, AutoCommit => 1}) || die "Database connecti +on not made: $DBI::errstr"; # Find files in specified directory my $dir = 'C:\Users\user\Desktop\Timing Logs'; find(\&print_name, $dir); # Read found files into database sub print_name { if(-f){ #Read csv file line by line my $file = $_; my $gz = gzopen($file, "rb") or die "Cannot open $file: $gzerrno\n +"; while ($gz->gzreadline($_) > 0) { my $line = $_; chomp($line); my @array = split /,/, $line; #Match log type # if ($file =~ m/jips/){ # my $sql = "INSERT INTO dbo.JIPS VALUES (?,?,?,?,?,?,?,?,?,?, +?,?,?,?,?);"; # my $sth = $dbh->prepare( $sql ); # $sth->execute(@array); # } if ($file =~ m/ltms/){ my $sql = "INSERT INTO dbo.LTMS VALUES (?,?,?,?,?,?,?,?,?,?,?, +?,?,?,?,?,?,?,?,?,?);"; my $sth = $dbh->prepare( $sql ); $sth->execute(@array); } } die "Error reading $file: $gzerrno" if $gzerrno != Z_STREAM_END ; $gz->gzclose(); } }

Replies are listed 'Best First'.
Re: Performance issues with inserting csv into SQL Server
by roboticus (Chancellor) on Jul 11, 2013 at 17:21 UTC

    SRick:

    I'd suggest using perl to put the log file into a standardized format that BCP can digest easily. Bulk-insert tools are optimized for this sort of thing.

    BCP is pretty easy to use, too.

    I find it best to BCP the data into a temporary table, then do a cleanup pass or two (deleting unwanted messages, etc.), though if you process the file before giving it to BCP, you can do the cleanup in perl rather than in the temporary table. Then you can integrate the data into your working tables.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Performance issues with inserting csv into SQL Server
by Perlbotics (Archbishop) on Jul 11, 2013 at 17:23 UTC

    Three things you could try, sorted by estimated gain/change:

    • Switch off AutoCommit and do a manual commit() every few lines or even only once after the loop. This should give you the biggest gain/change.
    • It is not necessary to prepare(...) the statement handle within each iteration. But I guess this will not have a big impact.
    • If program execution is still too slow, try to use the batch import tool of your DB. Create an import file first, then use the batch loader.
    HTH

Re: Performance issues with inserting csv into SQL Server
by MidLifeXis (Monsignor) on Jul 11, 2013 at 17:21 UTC

    You will probably see better results by using the vendor's batch loading application. However, you are doing a commit after every insert, which is expensive. That is probably the main location of your slowness. The transactional safety comes at a price. When batch loading, things like transactions, index updates, and other database features can be more expensive than necessary.

    Even if you take all of that into account, unless there is something special that you need to do with the data before loading (in which case, preprocessing to something that the batch loader can use), I would still probably use the batch loader.

    --MidLifeXis

Re: Performance issues with inserting csv into SQL Server
by nikosv (Deacon) on Jul 12, 2013 at 06:56 UTC

    what is your underlying table,is it a BTREE? do you have an index on it ? do you allow duplicate values?

    it might be worth to check modifying the table into a heap which will drop the indexes so the rows will get inserted much faster, and then modify it back to the original structure/re-enable the dropped indexes