Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o]

metaperl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Taking advantage of dual processors (threads) by BrowserUk (Patriarch) on Nov 19, 2007 at 16:06 UTC
This should work anywhere including cygwin: use threads; use Threads::Queue; my $Q = new Thread::Queue; my $thread = async { ## create/open DB here my $dbs = ...; while( my $query = $Q->dequeue ) { $dbs->query( $query ); } ## finish with DB here }; while (<I>) { my $line = $_; my @bucket = $bucket->based_on($line); # Data::Bucket my @acct_vals; for (@bucket) { # %hash = calculate based on input and bucket value push @acct_vals, \%hash; } next unless @acct_vals; @acct_vals = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, $_->{dist} ] } @acct_vals; my @dump ; for my $acct_val (@acct_vals) { last if $acct_val->{dist} > $match_threshold * 1.5 ; push @dump, $acct_val; } my $d = Data::Dumper->new([\@dump], ['dump']) ; $d->Purity(1)->Terse(1)->Deepcopy(1); ## Build sql and post to db thread (Could be done better) my $args = "'" . join( "','", $acct_name, $clean_acct_name, $d->Du +mp) . "'"; sleep 1 while $Q->pending > 10; ## Adjust threshhold to suit. $Q-enqueue( sprintf 'INSERT INTO addresses VALUES (%s)', $args ); unless (++$counter % $report_very) { warn "$counter in $now"; # Time::Lapse by Scott Godin } } $Q->enqueue( undef ); ## Terminate db thread $thread->join; ## And wait for it to finish. [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: Taking advantage of dual processors (threads) by metaperl (Curate) on Nov 19, 2007 at 21:23 UTC
Wait a minute, Threads::Queue has been backpanned. It's not on CPAN. Also, it seems that all solutions are suggesting lightweight processes. But I have 2 cpus, why shouldnt I be going for 1 heavyweight process per CPU, one for the file reading and writing and the other for a database commit. I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup	[reply]
Re^3: Taking advantage of dual processors (threads) by BrowserUk (Patriarch) on Nov 19, 2007 at 21:52 UTC
Threads::Queue has been backpanned. Sorry, typo. It should be Thread::Queue per the usage in my example: `my $Q = new Thread::Queue;` [download] (and no. You don't have to use indirect object syntax :) Also, it seems that all solutions are suggesting lightweight processes. But I have 2 cpus, why shouldnt I be going for 1 heavyweight process per CPU, one for the file reading and writing and the other for a database commit. No reason except that then you'll need to communicate between processes. IPC is mostly not portable. Mostly inefficient. Mostly a pain in the tush. You need a communications channel: pipes, sockets, or shared memory. You need a communications protocol or interprocess semaphores. But mostly, you just moved the problem. Now you have to block on reads and writes from your IPC channel. Thread::Queue is available, tested and works well. You're also barking up the wrong tree thinking of it as one process per cpu. Each process (or thread) will get run on which ever cpu comes free first when it is next ready to run. both processes or threads could end up always running on the same cpu because some other process in your system is hogging one of them. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o] by cdarke (Prior) on Nov 19, 2007 at 15:31 UTC
Running the commit in another thread is obvious, but you need to check if SQLite is thread-safe (others will know). But there are different approaches. You can split up your data (lines in the file) and run two processes or threads each handling half of the data. I suggest that this might be simpler to implement - and safer since they should not be stomping on each other (assuming database functionality). Or you could split up the task, doing the comparisons in one thread and the write-out in another. That would be more fiddly to implement, since you have to pass data between threads.	[reply]
Re^2: Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o] by metaperl (Curate) on Nov 19, 2007 at 21:15 UTC
Your advice is a little above my head. I asked for modules or functions... and was subconsciously desiring small code snippets.. I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup	[reply]
Re: Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o] by kyle (Abbot) on Nov 19, 2007 at 15:33 UTC
It sounds as if you want to fork for the database. Maybe something like this: `wait; # wait for any previous child to finish my $pid = fork; die "Can't fork: $!" if ! defined $pid; if ( ! $pid ) { # child do_sqllite_stuff(); exit; # this is important }` [download] Note that if you open a database handle in the parent, the child is going to clobber it. If you don't need to access the database in the parent, my advice is to open and close it in the child. Otherwise, have a look at DBI, fork, and clone..	[reply] [d/l]
Re: Taking advantage of dual processors by chromatic (Archbishop) on Nov 19, 2007 at 17:34 UTC
You want one process to block on IO (at least block buffered, if not more) while the other process blocks on a database commit? This doesn't seem like the type of problem where concurrency will help you much.	[reply]
Re^2: Taking advantage of dual processors by BrowserUk (Patriarch) on Nov 19, 2007 at 20:34 UTC
I'm confused why you think it wouldn't? Done serially with '.'s representing time taken: `v---------------------------------\| read........munge....insert........` [download] versus overlapping the insert and the read in parallel: `v-----------------------\| read........munge.....QI. v-----------------------\| DQ.insert........Wait....` [download] Even on a single cpu system, and with the DB running on the same box, the insert can go ahead whilst the disk head locates the next chunk of data. It will depend upon the relative expense of the overlapped operations, but on a multi-cpu system the elapsed time should be reduced? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^3: Taking advantage of dual processors by chromatic (Archbishop) on Nov 19, 2007 at 23:13 UTC
Does SQLite perform blocking flushes on every commit? Does it perform asynchronous writes? Is the file to read on the same bus as the database file? Are they on the same drive? Are they on the same platter? It would surprise me to see much more than a 15% improvement from concurrency, and it wouldn't surprise me at all to see performance decrease.	[reply]
Re^4: Taking advantage of dual processors by BrowserUk (Patriarch) on Nov 20, 2007 at 02:38 UTC
Re^2: Taking advantage of dual processors by metaperl (Curate) on Nov 19, 2007 at 21:19 UTC
Well, what I want is maximum throughput for two operations which are currently sequential. It seems like dispatching the database commit to a 2nd cpu would allow the first one to continue reading, but I really dont know. I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup	[reply]
Re^3: Taking advantage of dual processors by sgt (Deacon) on Nov 19, 2007 at 23:01 UTC
Why don't you try it? make these extra processes happen in a controlled fashion. Even without having a pool of permanent sqlite writers getting input from a queue which would be the right way to do it (as BrowserUK pointed out even though its example uses threads), you can still quickly prototype with the code you've shown using the following: keep an array of n max. commits-to-do and every m <n lines fire m processes system("my_writer $update &"), throttle a bit if ps (or even /proc) gives you too many my_writer processes, increase if your process count drops below a threshold (can be made adaptive...). Decide what to do if you reach n. (slow down, wait for #proc < n_min etc...) cheers --stephan	[reply]