I can also thread the DBD section but I shouldn't share the same connection handle. Each thread should create / delete its own DBD connection.
I saw another poster suggest that it was 'safe to thread DBI provided each thread has it's own handle'. I was reluctant to interceded as that poster may know better than I, but certainly at at least one point in time this was not the case.
Some or all DBI drivers (and/or the shared libraries that underlie them) have been shown to not be thread-safe on some or all platforms at some point in the past. The problem is, for at least some drivers/libraries, the process ID is used to key data to the connection, so using multiple handles from a single process causes things to get mixed up.
You will have to go to the DBI mailing list for the current skinny on what drivers are thread-safe where--if any. You will also probably be advised that you "should not use DBI with or from threads". This seems to be the defacto-standard position.
I will say that you can use DBI from a threaded program provided that you only use DBI from one thread of that program. Basically, if the program uses DBI from one (say the main) thread, then the presence of other threads that do not use DBI should not compromise anything as the DBI code will not even be aware that they are there.
In any case, as you are processing large numbers of mostly relatively small files, and running under *nix, there seems to be little or no advantage in using threads over processes for this. In this case I suggest using processes rather than threads.
Parallel::ForkManager seems almost ideal for the application. Have the main process load an array of the urls to be fetched and then fork a child to do a simple, linear read->parse->upload-to-DB-and-die. The parent process' role is simply to monitor the child processes and start another when one completes.
Tuning the process is simple a case of varying the set_max_procs() at startup. With an 8-way box (assuming nothing else heavy is running concurrently on that box, you should run at least one process per CPU for maximum throughput. As for at least some large percentage of their time, each process will be IO-bound doing the download, then you should be able to achieve greater throughput by having double or even treble the number of processes running as you have CPUs.
The CPU-bound processes (those that have completed their downloads and are into the parsing phase) will be able to usefully use the time-slices that the IO-bound processes will relinquish in recv wait states. For best effect, you should decrease (nice) the priority of the processes once they complete their downloads and move into the parsing phase. That will allow the IO-bound processes to respond quickly to the receipt of data, but will have little affect on the cpu-bound processes as they will move back into another wait state almost immediately.
A further option for tuning is to arrange the urls, or the picking of urls for newly forked children so that you avoid downloading more than one file from the same host concurrently. That probably means keeping the urls for each host in separate arrays and only using an url for a new host once the previous download from that host completes. That complicates the model somewhat as you would need to use 'nested parallel managers'.
The original process would fork a subprocess to process the urls for a given host. That child would download the first file, then fork again. The grand-child process would then parse the data downloaded while its parent starts another download from the same host. Once the parent completes it's download it would wait until the child completes before forking again and repeating the process. The grandparent process would only fork another once that host has been completed. It's hard to describe, but not too difficult to program.
Anyway, few ideas for you to mull over. Given the architecture of your setup, I see little benefit in using threads for this.
In reply to Re^3: Threading - getting better use of my MP box
by BrowserUk
in thread Threading - getting better use of my MP box
by ethrbunny
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |