Re: Taking advantage of dual processors (threads)
by BrowserUk (Patriarch) on Nov 19, 2007 at 16:06 UTC
|
use threads;
use Threads::Queue;
my $Q = new Thread::Queue;
my $thread = async {
## create/open DB here
my $dbs = ...;
while( my $query = $Q->dequeue ) {
$dbs->query( $query );
}
## finish with DB here
};
while (<I>) {
my $line = $_;
my @bucket = $bucket->based_on($line); # Data::Bucket
my @acct_vals;
for (@bucket) {
# %hash = calculate based on input and bucket value
push @acct_vals, \%hash;
}
next unless @acct_vals;
@acct_vals =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [ $_, $_->{dist} ] } @acct_vals;
my @dump ;
for my $acct_val (@acct_vals) {
last if $acct_val->{dist} > $match_threshold * 1.5 ;
push @dump, $acct_val;
}
my $d = Data::Dumper->new([\@dump], ['dump']) ;
$d->Purity(1)->Terse(1)->Deepcopy(1);
## Build sql and post to db thread (Could be done better)
my $args = "'" . join( "','", $acct_name, $clean_acct_name, $d->Du
+mp) . "'";
sleep 1 while $Q->pending > 10; ## Adjust threshhold to suit.
$Q-enqueue( sprintf 'INSERT INTO addresses VALUES (%s)', $args );
unless (++$counter % $report_very) {
warn "$counter in $now"; # Time::Lapse by Scott Godin
}
}
$Q->enqueue( undef ); ## Terminate db thread
$thread->join; ## And wait for it to finish.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Wait a minute, Threads::Queue has been backpanned.
It's not on CPAN.
Also, it seems that all solutions are suggesting lightweight processes. But I have 2 cpus, why shouldnt I be going for 1 heavyweight process per CPU, one for the file reading and writing and the other for a database commit.
I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes.
How can you say that there is no God in the Church of Emacs? -- David Kastrup
| [reply] |
|
|
my $Q = new Thread::Queue;
(and no. You don't have to use indirect object syntax :)
Also, it seems that all solutions are suggesting lightweight processes. But I have 2 cpus, why shouldnt I be going for 1 heavyweight process per CPU, one for the file reading and writing and the other for a database commit.
No reason except that then you'll need to communicate between processes. IPC is mostly not portable. Mostly inefficient. Mostly a pain in the tush.
You need a communications channel: pipes, sockets, or shared memory.
You need a communications protocol or interprocess semaphores.
But mostly, you just moved the problem. Now you have to block on reads and writes from your IPC channel.
Thread::Queue is available, tested and works well.
You're also barking up the wrong tree thinking of it as one process per cpu. Each process (or thread) will get run on which ever cpu comes free first when it is next ready to run. both processes or threads could end up always running on the same cpu because some other process in your system is hogging one of them.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o]
by cdarke (Prior) on Nov 19, 2007 at 15:31 UTC
|
Running the commit in another thread is obvious, but you need to check if SQLite is thread-safe (others will know). But there are different approaches.
You can split up your data (lines in the file) and run two processes or threads each handling half of the data. I suggest that this might be simpler to implement - and safer since they should not be stomping on each other (assuming database functionality).
Or you could split up the task, doing the comparisons in one thread and the write-out in another. That would be more fiddly to implement, since you have to pass data between threads. | [reply] |
|
|
Your advice is a little above my head. I asked for modules or functions... and was subconsciously desiring small code snippets..
I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes.
How can you say that there is no God in the Church of Emacs? -- David Kastrup
| [reply] |
Re: Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o]
by kyle (Abbot) on Nov 19, 2007 at 15:33 UTC
|
wait; # wait for any previous child to finish
my $pid = fork;
die "Can't fork: $!" if ! defined $pid;
if ( ! $pid ) {
# child
do_sqllite_stuff();
exit; # this is important
}
Note that if you open a database handle in the parent, the child is going to clobber it. If you don't need to access the database in the parent, my advice is to open and close it in the child. Otherwise, have a look at DBI, fork, and clone.. | [reply] [d/l] |
Re: Taking advantage of dual processors
by chromatic (Archbishop) on Nov 19, 2007 at 17:34 UTC
|
You want one process to block on IO (at least block buffered, if not more) while the other process blocks on a database commit?
This doesn't seem like the type of problem where concurrency will help you much.
| [reply] |
|
|
v---------------------------------|
read........munge....insert........
versus overlapping the insert and the read in parallel:
v-----------------------|
read........munge.....QI.
v-----------------------|
DQ.insert........Wait....
Even on a single cpu system, and with the DB running on the same box, the insert can go ahead whilst the disk head locates the next chunk of data.
It will depend upon the relative expense of the overlapped operations, but on a multi-cpu system the elapsed time should be reduced?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
Does SQLite perform blocking flushes on every commit? Does it perform asynchronous writes?
Is the file to read on the same bus as the database file? Are they on the same drive? Are they on the same platter?
It would surprise me to see much more than a 15% improvement from concurrency, and it wouldn't surprise me at all to see performance decrease.
| [reply] |
|
|
|
|
Well, what I want is maximum throughput for two operations which are currently sequential. It seems like dispatching the database commit to a 2nd cpu would allow the first one to continue reading, but I really dont know.
I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes.
How can you say that there is no God in the Church of Emacs? -- David Kastrup
| [reply] |
|
|
Why don't you try it? make these extra processes happen in a
controlled fashion. Even without having a pool of permanent sqlite
writers getting input from a queue which would be the right way to do it (as BrowserUK pointed out even though its example uses threads), you can still quickly prototype with the code you've shown using the following:
keep an array of n max. commits-to-do and every m <n lines fire m processes system("my_writer $update &"), throttle a bit if ps (or even /proc) gives you too many my_writer processes, increase if your process count drops below a threshold (can be made adaptive...).
Decide what to do if you reach n. (slow down, wait for #proc < n_min
etc...)
cheers
--stephan
| [reply] |