metaperl has asked for the wisdom of the Perl Monks concerning the following question:

I have a program which loops over an input file, one line at a time. It does some comparisons of this one line to a lookup file and then writes out the best match to a file.

However, before it continues to the next line in the file, it needs to make a single simple database commit to SQLite based on in-memory values - nothing new needs to be computed.

I am running Perl 5.8.8 under Cygwin.

My goal is to take advantage of the fact that this machine has two CPUs - it is an Intel Centrino Duo. I want the database commit to happen on the second CPU while the machine goes ahead and reads the next line.

What sort of module or function would be best for this purpose?

while (<I>) { my $line = $_; my @bucket = $bucket->based_on($line); # Data::Bucket my @acct_vals; for (@bucket) { # %hash = calculate based on input and bucket value push @acct_vals, \%hash; } next unless @acct_vals; @acct_vals = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, $_->{dist} ] } @acct_vals; # write $acct_vals[0] and $line out to file # ** Help needed ** # write out to SQLite but without holding up # next loop iteration my @dump ; for my $acct_val (@acct_vals) { last if $acct_val->{dist} > $match_threshold * 1.5 ; push @dump, $acct_val; } my $d = Data::Dumper->new([\@dump], ['dump']) ; $d->Purity(1)->Terse(1)->Deepcopy(1); # DBIx::Simple by Juerd Waalboer $dbs->query('INSERT INTO addresses VALUES (??)', $acct_name, $clean_acct_name, $d->Dump); # *** end Help Needed unless (++$counter % $report_very) { warn "$counter in $now"; # Time::Lapse by Scott Godin } }
I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup
  • Comment on Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o]
  • Download Code

Replies are listed 'Best First'.
Re: Taking advantage of dual processors (threads)
by BrowserUk (Patriarch) on Nov 19, 2007 at 16:06 UTC

    This should work anywhere including cygwin:

    use threads; use Threads::Queue; my $Q = new Thread::Queue; my $thread = async { ## create/open DB here my $dbs = ...; while( my $query = $Q->dequeue ) { $dbs->query( $query ); } ## finish with DB here }; while (<I>) { my $line = $_; my @bucket = $bucket->based_on($line); # Data::Bucket my @acct_vals; for (@bucket) { # %hash = calculate based on input and bucket value push @acct_vals, \%hash; } next unless @acct_vals; @acct_vals = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, $_->{dist} ] } @acct_vals; my @dump ; for my $acct_val (@acct_vals) { last if $acct_val->{dist} > $match_threshold * 1.5 ; push @dump, $acct_val; } my $d = Data::Dumper->new([\@dump], ['dump']) ; $d->Purity(1)->Terse(1)->Deepcopy(1); ## Build sql and post to db thread (Could be done better) my $args = "'" . join( "','", $acct_name, $clean_acct_name, $d->Du +mp) . "'"; sleep 1 while $Q->pending > 10; ## Adjust threshhold to suit. $Q-enqueue( sprintf 'INSERT INTO addresses VALUES (%s)', $args ); unless (++$counter % $report_very) { warn "$counter in $now"; # Time::Lapse by Scott Godin } } $Q->enqueue( undef ); ## Terminate db thread $thread->join; ## And wait for it to finish.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Wait a minute, Threads::Queue has been backpanned. It's not on CPAN.

      Also, it seems that all solutions are suggesting lightweight processes. But I have 2 cpus, why shouldnt I be going for 1 heavyweight process per CPU, one for the file reading and writing and the other for a database commit.

      I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup
        Threads::Queue has been backpanned.

        Sorry, typo. It should be Thread::Queue per the usage in my example:

        my $Q = new Thread::Queue;

        (and no. You don't have to use indirect object syntax :)

        Also, it seems that all solutions are suggesting lightweight processes. But I have 2 cpus, why shouldnt I be going for 1 heavyweight process per CPU, one for the file reading and writing and the other for a database commit.

        No reason except that then you'll need to communicate between processes. IPC is mostly not portable. Mostly inefficient. Mostly a pain in the tush.

        You need a communications channel: pipes, sockets, or shared memory.

        You need a communications protocol or interprocess semaphores.

        But mostly, you just moved the problem. Now you have to block on reads and writes from your IPC channel.

        Thread::Queue is available, tested and works well.

        You're also barking up the wrong tree thinking of it as one process per cpu. Each process (or thread) will get run on which ever cpu comes free first when it is next ready to run. both processes or threads could end up always running on the same cpu because some other process in your system is hogging one of them.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o]
by cdarke (Prior) on Nov 19, 2007 at 15:31 UTC
    Running the commit in another thread is obvious, but you need to check if SQLite is thread-safe (others will know). But there are different approaches.

    You can split up your data (lines in the file) and run two processes or threads each handling half of the data. I suggest that this might be simpler to implement - and safer since they should not be stomping on each other (assuming database functionality).

    Or you could split up the task, doing the comparisons in one thread and the write-out in another. That would be more fiddly to implement, since you have to pass data between threads.
      Your advice is a little above my head. I asked for modules or functions... and was subconsciously desiring small code snippets..
      I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup
Re: Taking advantage of dual processors [tag://parallel,cygwin,multi-core,fork,i/o]
by kyle (Abbot) on Nov 19, 2007 at 15:33 UTC

    It sounds as if you want to fork for the database. Maybe something like this:

    wait; # wait for any previous child to finish my $pid = fork; die "Can't fork: $!" if ! defined $pid; if ( ! $pid ) { # child do_sqllite_stuff(); exit; # this is important }

    Note that if you open a database handle in the parent, the child is going to clobber it. If you don't need to access the database in the parent, my advice is to open and close it in the child. Otherwise, have a look at DBI, fork, and clone..

Re: Taking advantage of dual processors
by chromatic (Archbishop) on Nov 19, 2007 at 17:34 UTC

    You want one process to block on IO (at least block buffered, if not more) while the other process blocks on a database commit?

    This doesn't seem like the type of problem where concurrency will help you much.

      I'm confused why you think it wouldn't?

      Done serially with '.'s representing time taken:

      v---------------------------------| read........munge....insert........

      versus overlapping the insert and the read in parallel:

      v-----------------------| read........munge.....QI. v-----------------------| DQ.insert........Wait....

      Even on a single cpu system, and with the DB running on the same box, the insert can go ahead whilst the disk head locates the next chunk of data.

      It will depend upon the relative expense of the overlapped operations, but on a multi-cpu system the elapsed time should be reduced?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Does SQLite perform blocking flushes on every commit? Does it perform asynchronous writes?

        Is the file to read on the same bus as the database file? Are they on the same drive? Are they on the same platter?

        It would surprise me to see much more than a 15% improvement from concurrency, and it wouldn't surprise me at all to see performance decrease.

      Well, what I want is maximum throughput for two operations which are currently sequential. It seems like dispatching the database commit to a 2nd cpu would allow the first one to continue reading, but I really dont know.
      I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup

        Why don't you try it? make these extra processes happen in a controlled fashion. Even without having a pool of permanent sqlite writers getting input from a queue which would be the right way to do it (as BrowserUK pointed out even though its example uses threads), you can still quickly prototype with the code you've shown using the following: keep an array of n max. commits-to-do and every m <n lines fire m processes system("my_writer $update &"), throttle a bit if ps (or even /proc) gives you too many my_writer processes, increase if your process count drops below a threshold (can be made adaptive...). Decide what to do if you reach n. (slow down, wait for #proc < n_min etc...) cheers --stephan