I can also thread the DBD section but I shouldn't share the same connection handle. Each thread should create / delete its own DBD connection.

I saw another poster suggest that it was 'safe to thread DBI provided each thread has it's own handle'. I was reluctant to interceded as that poster may know better than I, but certainly at at least one point in time this was not the case.

Some or all DBI drivers (and/or the shared libraries that underlie them) have been shown to not be thread-safe on some or all platforms at some point in the past. The problem is, for at least some drivers/libraries, the process ID is used to key data to the connection, so using multiple handles from a single process causes things to get mixed up.

You will have to go to the DBI mailing list for the current skinny on what drivers are thread-safe where--if any. You will also probably be advised that you "should not use DBI with or from threads". This seems to be the defacto-standard position.

I will say that you can use DBI from a threaded program provided that you only use DBI from one thread of that program. Basically, if the program uses DBI from one (say the main) thread, then the presence of other threads that do not use DBI should not compromise anything as the DBI code will not even be aware that they are there.

In any case, as you are processing large numbers of mostly relatively small files, and running under *nix, there seems to be little or no advantage in using threads over processes for this. In this case I suggest using processes rather than threads.

Parallel::ForkManager seems almost ideal for the application. Have the main process load an array of the urls to be fetched and then fork a child to do a simple, linear read->parse->upload-to-DB-and-die. The parent process' role is simply to monitor the child processes and start another when one completes.

Tuning the process is simple a case of varying the set_max_procs() at startup. With an 8-way box (assuming nothing else heavy is running concurrently on that box, you should run at least one process per CPU for maximum throughput. As for at least some large percentage of their time, each process will be IO-bound doing the download, then you should be able to achieve greater throughput by having double or even treble the number of processes running as you have CPUs.

The CPU-bound processes (those that have completed their downloads and are into the parsing phase) will be able to usefully use the time-slices that the IO-bound processes will relinquish in recv wait states. For best effect, you should decrease (nice) the priority of the processes once they complete their downloads and move into the parsing phase. That will allow the IO-bound processes to respond quickly to the receipt of data, but will have little affect on the cpu-bound processes as they will move back into another wait state almost immediately.

A further option for tuning is to arrange the urls, or the picking of urls for newly forked children so that you avoid downloading more than one file from the same host concurrently. That probably means keeping the urls for each host in separate arrays and only using an url for a new host once the previous download from that host completes. That complicates the model somewhat as you would need to use 'nested parallel managers'.

The original process would fork a subprocess to process the urls for a given host. That child would download the first file, then fork again. The grand-child process would then parse the data downloaded while its parent starts another download from the same host. Once the parent completes it's download it would wait until the child completes before forking again and repeating the process. The grandparent process would only fork another once that host has been completed. It's hard to describe, but not too difficult to program.

Anyway, few ideas for you to mull over. Given the architecture of your setup, I see little benefit in using threads for this.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^3: Threading - getting better use of my MP box by BrowserUk
in thread Threading - getting better use of my MP box by ethrbunny

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.