How to optimise " csv into mysql " using Text:CSV

taiko has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by Corion (Patriarch) on Aug 06, 2015 at 08:51 UTC
I would try to use the bulk load facility of MySQL instead of sending single INSERT statements. Usually, bulk load is much much faster than sending single INSERT statements. Maybe the INSERT DELAYED is already fast enough, alternatively, the LOAD DATA INFILE syntax is what I'd aim for. All of this advice comes from https://dev.mysql.com/doc/refman/5.5/en/optimizing-myisam-bulk-data-loading.html and https://dev.mysql.com/doc/refman/5.6/en/optimizing-innodb-bulk-data-loading.html.	[reply]
Re^2: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by taiko (Sexton) on Aug 06, 2015 at 09:15 UTC
thank you for the hint, however, I tried to use LOAD DATA prior to this solution, but as input data is very dirty, I wanted to have more control on whats being imported. But maybe I can try to remove the garbage and then use LOAD DATA.	[reply]
Re^3: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by Corion (Patriarch) on Aug 06, 2015 at 09:25 UTC
This is the approach I use with a different database which basically can only do bulk loads, because single INSERTs are very slow. The approach is: Read input data using Text::CSV_XS Perform cleanup on each row Add relevant columns (`source_file`, `source_date`) Write output TSV file in the correct encoding (Text::CSV_XS, this time writing tab-separated output) Issue bulk load statement Check that the number of rows read in 1, written in 4 and retrieved from the database are as expected	[reply] [d/l] [select]
Re^4: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by taiko (Sexton) on Aug 06, 2015 at 09:35 UTC
Re: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by GrandFather (Saint) on Aug 06, 2015 at 08:53 UTC
The first step is to run Devel::Profile over your code and see where the time is spent. If it's parsing the csv then there may not be much you can do. If it's adding rows you may find adding multiple rows at a time helps. But the important thing is to know where to focus attention and for that you need to profile your code and see where the bottlenecks are. Premature optimization is the root of all job security	[reply]
Re^2: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by SimonPratt (Friar) on Aug 06, 2015 at 12:23 UTC
Ouch, overkill The first step is to understand just what the code is doing. If taiko doesn't understand what the code is doing, Devel::Profile won't help. In this case taiko is connecting to an SQL database without explicitly disabling autocommit. This is where the slow down is occurring, as every time they call execute on the statement handler, it executes the statement then commits the change. I highly recommend they follow the advice given by Corion in this case, as using a bulk load tool will be even faster than disabling autocommit.	[reply]
Re^3: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by Laurent_R (Canon) on Aug 06, 2015 at 20:06 UTC
Hi SimonPratt, Hm, I sort of disagree. It is very rarely (well almost never) a bad idea to use a profiling tool when your code is running too slow. It is a known fact that developers are notoriously bad at figuring out where the code is slow without using a profiling tool. And, basically, if you double the speed of a procedure or function that takes only 10% of the overall execution time, then your win will be about 5%, i.e. nothing substantial. And that happens very frequently. So it is really better to concentrate of the parts of the code that are really slow. Perhaps it is not what you think. Perhaps it is a single line of code that is really poorly designed, and you won't know until you test it. So, even though I also have the general feeling, in the case in point, that the performance problem is probably with the database (and the autocommit, commit rate, transaction control, redo logs, and so on), you can't be sure until you have profiled your code. To me, the 30% improvement obtained by the original poster is quite small. In my experience, real performance breakthroughs are much higher (one to two orders of magnitude). In short, if you don't have a performance problem, it is probably useless to optimize. But if you do have a performance problem, using a profiling tool (even a very limited one) is usually better than shooting in the dark. I am working on a proprietary language for accessing a very large database. That language has no profiling tools. I was asked to improve performance of our extraction programs. After having become pi**ed off at manually modifying programs to find out where they spent a lot of time, I wrote a little Perl program that would modify our program sources in that language in order to get a detailed account of where the program was spending a lot of time. I am really happy that I did that and I wish I had done that earlier. Now, I know immediately if and where I can improve those programs. In brief, if you have a performance problem, I do not think that profiling the code can ever be a mistake.	[reply]
Re^4: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by SimonPratt (Friar) on Aug 07, 2015 at 11:52 UTC
Re^5: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by Laurent_R (Canon) on Aug 07, 2015 at 13:18 UTC
Re^3: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by taiko (Sexton) on Aug 06, 2015 at 13:07 UTC
Thanx for the hint, indeed switching AutoCOmmit off increased the speed by roughly 30%.	[reply]
Re^4: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by SimonPratt (Friar) on Aug 06, 2015 at 14:55 UTC
Re^2: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by taiko (Sexton) on Aug 06, 2015 at 09:12 UTC
Tried Devel::Profile, but its somehow not running and giving the error `String found where operator expected at (eval 8)[C:/Strawberry/perl/si +te/lib/Devel/Profile.pm:133] line 2, near "croak 'usage: $io->getline +()'" (Do you need to predeclare croak?) <br>syntax error at (eval 8)[C:/Strawberry/perl/site/lib/Devel/Profile +.pm:133] line 2, near "croak 'usage: $io->getline()'"` [download]	[reply] [d/l]
Re^3: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by GrandFather (Saint) on Aug 06, 2015 at 09:53 UTC
Sorry, it's a while since I profiled code. I suspect in fact that I used Devel::NYTProf (but forgot the module name) rather than Devel::Profile. I've just installed and run NYTProf against a test script and it all worked as advertised. Premature optimization is the root of all job security	[reply]
Re^4: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by taiko (Sexton) on Aug 06, 2015 at 10:31 UTC
Re^5: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by choroba (Cardinal) on Aug 06, 2015 at 11:00 UTC
Some notes below your chosen depth have not been shown here
Re^3: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by Laurent_R (Canon) on Aug 06, 2015 at 09:39 UTC
`croak` is a function exported by the Carp module. Try to add this to your code: `use Carp;` [download]	[reply] [d/l] [select]
Re^4: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by taiko (Sexton) on Aug 06, 2015 at 09:50 UTC
Re^5: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by Laurent_R (Canon) on Aug 06, 2015 at 09:58 UTC
Re^4: How to optimise " csv into mysql " using TEXT:CSV_XS and DBD::Mysql by taiko (Sexton) on Aug 06, 2015 at 10:06 UTC
Re: How to optimise " csv into mysql " using Text:CSV_XS and DBD::Mysql by chacham (Prior) on Aug 06, 2015 at 13:13 UTC
my $query = qq{INSERT INTO some_table (id,$fieldList) VALUES (id,$field_placeholders)}; Side comment: Dynamic SQL is not secure. Admittedly, this one is more interesting in that it actually uses placeholders, but, building the placeholders string kind of defeats the purpose. That is, there's a benefit that strings need not be escaped and there will be strong typing, but the statement is no more secure. If you're just doing an import, mysql supports it directly with LOAD DATA INFILE.	[reply]
Re^2: How to optimise " csv into mysql " using Text:CSV_XS and DBD::Mysql by Your Mother (Archbishop) on Aug 06, 2015 at 13:48 UTC
The usage of placeholders in the OP's code is secure and normal; maybe a little idiomatic but common. There's no programmatic difference between– `$sth = $dbh->prepare("something something ?, ?"); $sth->execute(@args);` [download] –and– `$place = "?, ?"; $sth = $dbh->prepare("something something $place"); $sth->execute(@args);` [download]	[reply] [d/l] [select]
Re^3: How to optimise " csv into mysql " using Text:CSV_XS and DBD::Mysql by chacham (Prior) on Aug 06, 2015 at 14:38 UTC
But it isn't "something something", its "$fieldList."	[reply]
Re^4: How to optimise " csv into mysql " using Text:CSV_XS and DBD::Mysql by Your Mother (Archbishop) on Aug 06, 2015 at 15:33 UTC
Re^5: How to optimise " csv into mysql " using Text:CSV_XS and DBD::Mysql by chacham (Prior) on Aug 06, 2015 at 16:29 UTC
Some notes below your chosen depth have not been shown here
Re: How to optimise " csv into mysql " using Text:CSV_XS and DBD::Mysql by locked_user sundialsvc4 (Abbot) on Aug 06, 2015 at 14:54 UTC
Why are you calling `$sth->finish()`?! From DBI: Indicate that no more data will be fetched from this statement handle before it is either executed again or destroyed. You almost certainly do not need to call this method. Adding calls to finish after loop that fetches all rows is a common mistake, don't do it, it can mask genuine problems like uncaught fetch errors. When all the data has been fetched from a SELECT statement, the driver will automatically call finish for you. So you should not call it explicitly except when you know that you've not fetched all the data from a statement handle and the handle won't be destroyed soon. ... etcetera ... go read the page at the link provided ... I notice that the examples in DBD::MySQL, from which this code was probably more-or-less “lifted,” themselves include calls to `finish()`, which I think should be removed from that documentation. Your design is appropriate in that it `prepare()`s the handle, then repeatedly uses it to `execute()` inserts that use placeholders. But I will wager that `finish()` is then doing all the wrong things to it, and I would gamble that this is where your speed is actually going-away.
Re^2: How to optimise " csv into mysql " using Text:CSV_XS and DBD::Mysql by taiko (Sexton) on Sep 04, 2015 at 22:11 UTC
I have read the manual and got your point will test if it makes any better to the performance of the scripts thanx for you hint edit: I removed the finish() from the while loop, but it didn't help much with performance,	[reply]