Parllel Processing Database Query

krish1982 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parllel Processing Database Query by psini (Deacon) on Jul 28, 2009 at 11:35 UTC
I think that your problem is that you can't share a single DB connection between forked processes. Try opening the connection after the fork. Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."	[reply]
Re: Parllel Processing Database Query by roboticus (Chancellor) on Jul 28, 2009 at 12:19 UTC
krish1982: Another problem may simply be that your request is taking too long, and the connection times out. If that's the case, your SQL may be working too hard. If so, you may want to review your table to see if it's indexed properly or similar. Are you sure you need the `$pm->finish` statement? The DBI documentation mentions that it's rarely needed. You might be masking a code bug by including it in your program. Notes on perl vs. databases When you use perl with a database, make sure you're doing the work in the correct place. The most common two problems I see are having the database retrieve a large quantity of records and then writing perl code to: (a) compute statistics (average, maximum/minimum, sum, etc.), or (b) find the records meeting a specific criteria. Both of these force the database to scan through a large number of records and transport them to the perl process. They also force the perl process to read the large number of records and perform the selections and processing. The primary problem with both is that you're having to spend a lot of resources to transport a large amount of data that you're going to discard anyway. Remember that to send a record from the database to the perl process, the computers involved must: DB server must locate the records DB server must format the record DB server must issue I/O requests to the OS to send the data perl process must receive each record perl process must decode the record into internal form Since SQL gives you a lot of flexibility in computing statistics and selecting records, you will reduce the burden on both the DB server and your perl process if you use SQL to compute statistics and select records. So if you're unfamiliar with SQL, spend a little time learning some more about it. Alternatively, describe what data you're trying to get to your DBA and ask him how to do it in SQL. You may be surprised to see how simple it can be. That said, I'm not sure you're doing the work in the wrong location, but I thought I'd mention it, just in case. But in most cases where I see people trying to speed up a program with perl and a database, they're doing the work in the wrong location. I find it uncommon to see speed problems when the tasks are divided up correctly. ...roboticus Update: Added the italicised text to make the sentence mean something...	[reply] [d/l]
Re^2: Parllel Processing Database Query by ctilmes (Vicar) on Jul 28, 2009 at 12:55 UTC
But in most cases where I see people trying to speed up a program with perl and a database, they're doing the work in the wrong location. I find it uncommon to see speed problems when the tasks are divided up correctly. BTW, the latest versions of PostgreSQL include support for Perl inside the database with PL/Perl, so you can sometimes get the best of both worlds.	[reply]
Re^2: Parllel Processing Database Query by metaperl (Curate) on Jul 28, 2009 at 13:59 UTC
Just a related comment. When using ETL tools (e.g. Informatica, Ab Initio, Teradata, Talend, etc), the best practice is to do your joins, computations in those tools because you get the benefit of massively parallel processing. The exception would be if you have a parallel database engine like Netezza.	[reply]
Re: Parllel Processing Database Query by dont_you (Hermit) on Jul 29, 2009 at 06:58 UTC
I've seen that error before when using DBI and Parallel::ForkManager. The problem arises when a $dbh is inherited from the parent, and several childs try to use the same connection at the same time. I solved this as follow: `my $fork_manager = new Parallel::ForkManager(10); $fork_manager->start and next; # fork # Clone parent dbh my $dbh_child = $dbh->clone(); $dbh->{InactiveDestroy} = 1; $dbh = $dbh_child;` [download] So you can try: `sub db_execution { ... my $dbh = $db->connect( 'students' ) or die $!; my $dbh_child = $dbh->clone(); $dbh->{InactiveDestroy} = 1; $dbh = $dbh_child; my $sth = $dbh->prepare( $sql ) or die "$!:" . $dbh->errstr; $sth->execute or die "$!:" . $sth->errstr; ... }` [download] Hope this helps	[reply] [d/l] [select]
Re^2: Parllel Processing Database Query by whakka (Hermit) on Jul 29, 2009 at 13:31 UTC
kyle wrote a great post demonstrating this that I keep for reference.	[reply]