baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

hi,

in my last post i had a similar problem but that part was overcome. now what i'm trying to do is apply the previous solution in parallel.

to elaborate : i have 4 databases DB1, DB2, DB1', DB2'. what i'm aiming to do, is connect to one database DB1' through one child process then attach DB1 and DB2 to DB1' find data in table DB1.T1 and DB2.T1 that have a match in table DB1.T2 and then insert it to a newly created table (GITAXNR) in database DB1'. the sam process i wish to repeat with child process nubmer 2.(connect to DB2', search tables DB1.T1, DB2.T1 fo a match in DB2.T2 and insert matches into new table (GITAXNR) in DB2').

here are schematics, and code i've written :

attachments DB1 | ------------------ ---------- |T1<--+--+ |<--------| | | | | |<\ / | DB 1' | <= connection | +--(---- T2| \ / | | ------|--|-------- \ / ---------- | | \ ------|--|-------- / \ ---------- |T1<--+--+ | / \ | | | | |</ \ | DB 2'| <= connection | +--- T2 |<--------| | ------------------ ---------- DB2 MAIN CODE: use strict; use Subs; use DBI; my $children; for (my $i = 1; $i<=2;$i++){ ### for my $i (1..2)... it works both +ways my $pid = fork(); if ($pid){ push @children, $pid; } elsif($pid == 0){ my $dbh = DBI->connect("dbi:SQLite:dbname=DB$i", "", "",{RaiseErro +r=>1, AutoCommit=>1}); ## pointing to DB 1' connection and DB 2' con +nection my $joinTable = "GITAXNR"; my $form1 = "gi INTEGER not null PRIMARY KEY, ID INTEGER not null, + quest BLOB not null"; $self->do_it_nr(argument => $beginn); $self->createtable_nr(table => $joinTable, form => $form1); $self->do_it_nr(argument => $commit); for(my $j = 1; $j<=2; $j++){ ### for my $j (1..2)... +it works both ways my $database = "attach database 'DBi$j' as db$j"; ## attaching + DB 1 and DB 2 to DB 1' in first child or DB 2' in second child $self->do_it_nr(argument => $database); $self->do_it_nr(argument => $beginn); my $insert = "insert into GITAXNR(gi,ID,quest) select GITAX.gi +, GITAX.ID, NRP.quest from db$i.GITAX inner join db$j.NRP on GITAX.gi + = NRP.gi"; $self->do_it_nr(argument => $insert); $self->do_it_nr(argument => $commit); } exit; } else { print "could not fork:$!"; } } foreach (@children){ waitpid($_,0) } SUBS: in new package: package Subs; ################################################## sub new { ################################################## my ($class,%arg) = @_; my $hash = {}; bless($hash,$class); } ################################################## sub do_it_nr { ################################################## my ($self, %arg) =@_; my $stm = $self->{argument}=$arg{argument}; my $st = $dbh->prepare($stm); $st ->execute(); } ################################################## sub drop_nr { ################################################## my ($self,%arg) = @_; my $arg = $self->{argument}=$arg{argument}; my $statement = "drop table if exists $arg"; $self->do_it_nr(argument => $statement); } ################################################## sub createtable_nr{ ################################################## my ($self, %arg)=@_; my $form = $self->{form}=$arg{form}; my $tabler = $self->{table}=$arg{table}; $self->drop_nr(argument => $tabler); my $stm = "create table $tabler ($form)"; $self->do_it_nr(argument => $stm); }
the error message i'm getting is that it cannot finde tables db1.GITAX (that is T2 table in the shematics, NRP is T1).

this is something that i didn't see anyone had any problems with, why, maybe it is not possible to do or it is to trivial to be mentioned. :)

thanks !!

note :

the procedure is working when i'm doing it sequentially (not with fork)

Replies are listed 'Best First'.
Re: querying multiple databases in parallel -- attach problem #2
by Narveson (Chaplain) on Nov 14, 2008 at 10:09 UTC

    Why fork?

    Just create multiple database handles and connect each one to a different database.

      with forking (or threading) i'm trying to speed up the process, the databases are quit large and when i query it as a one db then my one core is on 100% and the rest of them are just sitting there. so to speed up the process even twice it could mean 4h saved, not to mention running it on 8 cores, which i'm planing to do.

      so my question is focused on how to enable this.

      i'm also open for any alternative suggestions that would speed up my process

      thnx

        By breaking each DB handle out to a different core, you could still have idle time (one DB handle consistently takes longer to complete the query, for example).

        Can your problem be broken down more at the higher level? In other words, instead of breaking each handle out to a different thread, break the top level data set into N smaller sets and distribute those smaller jobs out to the individual cores. Each core would still have M (4 in this case?) DB handles, but each core could stay busy (less the slight overhead for managing the data sets).

        Now, if all of the top-level data is interdependent, then never mind :-)

        --MidLifeXis