in reply to Re^2: Threading - getting better use of my MP box
in thread Threading - getting better use of my MP box

I can also thread the DBD section but I shouldn't share the same connection handle. Each thread should create / delete its own DBD connection.

I saw another poster suggest that it was 'safe to thread DBI provided each thread has it's own handle'. I was reluctant to interceded as that poster may know better than I, but certainly at at least one point in time this was not the case.

Some or all DBI drivers (and/or the shared libraries that underlie them) have been shown to not be thread-safe on some or all platforms at some point in the past. The problem is, for at least some drivers/libraries, the process ID is used to key data to the connection, so using multiple handles from a single process causes things to get mixed up.

You will have to go to the DBI mailing list for the current skinny on what drivers are thread-safe where--if any. You will also probably be advised that you "should not use DBI with or from threads". This seems to be the defacto-standard position.

I will say that you can use DBI from a threaded program provided that you only use DBI from one thread of that program. Basically, if the program uses DBI from one (say the main) thread, then the presence of other threads that do not use DBI should not compromise anything as the DBI code will not even be aware that they are there.

In any case, as you are processing large numbers of mostly relatively small files, and running under *nix, there seems to be little or no advantage in using threads over processes for this. In this case I suggest using processes rather than threads.

Parallel::ForkManager seems almost ideal for the application. Have the main process load an array of the urls to be fetched and then fork a child to do a simple, linear read->parse->upload-to-DB-and-die. The parent process' role is simply to monitor the child processes and start another when one completes.

Tuning the process is simple a case of varying the set_max_procs() at startup. With an 8-way box (assuming nothing else heavy is running concurrently on that box, you should run at least one process per CPU for maximum throughput. As for at least some large percentage of their time, each process will be IO-bound doing the download, then you should be able to achieve greater throughput by having double or even treble the number of processes running as you have CPUs.

The CPU-bound processes (those that have completed their downloads and are into the parsing phase) will be able to usefully use the time-slices that the IO-bound processes will relinquish in recv wait states. For best effect, you should decrease (nice) the priority of the processes once they complete their downloads and move into the parsing phase. That will allow the IO-bound processes to respond quickly to the receipt of data, but will have little affect on the cpu-bound processes as they will move back into another wait state almost immediately.

A further option for tuning is to arrange the urls, or the picking of urls for newly forked children so that you avoid downloading more than one file from the same host concurrently. That probably means keeping the urls for each host in separate arrays and only using an url for a new host once the previous download from that host completes. That complicates the model somewhat as you would need to use 'nested parallel managers'.

The original process would fork a subprocess to process the urls for a given host. That child would download the first file, then fork again. The grand-child process would then parse the data downloaded while its parent starts another download from the same host. Once the parent completes it's download it would wait until the child completes before forking again and repeating the process. The grandparent process would only fork another once that host has been completed. It's hard to describe, but not too difficult to program.

Anyway, few ideas for you to mull over. Given the architecture of your setup, I see little benefit in using threads for this.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^4: Threading - getting better use of my MP box
by Joost (Canon) on Jul 06, 2007 at 18:00 UTC
    Anecdotal evidence: I'm currently using a multi-threaded perl/XS program that uses DBD::mysql 4.004 on several MP machines including an 8-core linux x86_64 box. In this program each thread (between 4 and 16 threads depending on configuration) creates and uses its own database handle.

    Although there are still some issues with shutting down the program cleanly, they don't appear to be database related. In any case, it'll run just fine for days without any issues. Note that we create the threads ASAP and then keep them during the whole time the program is running.

    The program doesn't really make heavy use of the database, though - most of the work is spend in fairly memory / cpu intensive calculations in the XS code, with the database acting as a dumb (single table) system to store the resulting data in).

    YMMV,

    Joost.