comment on

I'm doing a complex bit of data processing that goes like this:
* build a large data structure (~500 MB) by retrieve()-ing several data files,
* do some data mining to find all data points that agree to pre-defined filters and collect them in arrays,
* use IPC:Open2 to start an R process for statistics,
* feed the arrays (one at a time) and R commands to the R process and read back the results,
* dump the results into a LaTeX table.

Now all of this takes quite a long time - especially the R commands themselves.

I was wondering how could I speed up this procedure when I realized that I have a dual-core machine, and one glance at the CPU usage graph told me that one of the CPUs just sits there uselessly while my beard is turning white. If could have two R processes running at the same time, and I could share the input between them, I could have the processing finished in half the time! Only problem is, that I'm not exactly sure how to do it.

I guess the obvious answer is "threads", more specifically, a dispatcher/worker model where there is one central thread that assembles the input for the workers and puts it on a queue, and there are several (at least two) worker threads, each munching data off the queue and feeding it to its own R process.
However, I'm not comfortable with threads and I don't even know if the above scheme is viable at all. For example, how can I avoid a race condition when I'm filling up the result data structure (for the table) if the results themselves come in asynchronously? How do I tell which thread is busy and which one is available?

And there is an other thing: it would be best if the solution could be extended to include clients that run on other computers. That is, there would be a server that decides which datasets should be processed, then it would look around on the local network and send data to clients that have available slots. Are there by any chance ready-made solutions for problems like this?

In reply to Using threads to run multiple external processes at the same time by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.