ithreads weren't the way.. still searching

hlen has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: ithreads weren't the way.. still searching by BrowserUk (Patriarch) on Oct 01, 2004 at 04:22 UTC
...ithreads functionality doesn't even come close to what it would take for this to work. Pardon me, but poppycock!. From the scant information supplied, you want to fetch a sequence of pages concurrently and then re-assemble them, in the original order. Off the top of my head, I'd do something like this: Create 2 Thread::Queues One to supply the "$seq_no:$url" to the threads, One to return the fetched page "$seq_no:$contents" to the main thread. Start a number of threads that: Create their own user agents. Loop over the inputQ, waiting for a "$seq_no:$url", terminating when they dequeue undef. Split the seq_no & url. fetch the url. Prepend the sequence number to the contents and enqueue to the outputQ. loop till undef. Main thread enqueues the "$seq_no:$urls" to the inputQ. Main: waits for inputQ to empty. Main: enqueues 1 undef per thread. Main: Sort outputQ by the prepended seq_no into the correct order, splits off the sequence numbers and joins the contents. Processes the output. There is plenty of scope in there for overlapping the appending and processing with the fetching. The main thread can dequeue the returns, process those that come out in the right order and store out-of-sequence returns in a hash for easy lookup. Each time it completes processing one set of content, it looks first in the hash to see if the next in sequence is available. If not, it goes back to dequeuing until it gets it. With a little more ingenuity, the main thread could start another thread to do the processing that waits on a third Q. The main thread then dequeues and either re-queues to the processing thread or buffers in a hash. The processing thread then performs the final disposal of the processed accumulated data, whilst the main thread blocks waiting for it to finish. It's actually a very good use of threads and very straight forward to code. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^2: ithreads weren't the way.. still searching by hlen (Beadle) on Oct 01, 2004 at 05:16 UTC
Your post was clarifying, and I admit I do not have a great expertise with ithreads (though you seemed to miss the fact that the pages need to be gotten sequentially). The limitations I found that killed the act, at first, were not being able to use shared blessed objects, or classes, for that matter, even because unshared referents are trouble for ithreads. So how could a thread which does the processing call $root->push_content? Your post has given answers to those questions, which I'll study. I think a main thread holding $root and a queue of processed elements to be pushed should be best. Thanks a lot.	[reply]
Re^2: ithreads weren't the way.. still searching by meredith (Friar) on Oct 01, 2004 at 04:25 UTC
You're right! It is straightforward. I guess I should have waited for others to make their replies before typing all that. =) `mhoward - at - hattmoward.org`	[reply]
Re: ithreads weren't the way.. still searching by tachyon (Chancellor) on Oct 01, 2004 at 03:56 UTC
If as you say you _need_ to get pages sequentially then threading appears pointless as the vast majority of the time will be taken getting the data, not processing it. It usually takes 2-4 seconds to get a page. The quantity of processing you can do in this sort of time is huge. I would be extremely suprised if the processing time was more than a few % of the total runtime with the rest of the time simply waiting on the socket for data. You are into much more fruitful territory if you can avoid a sequential get and do at least part of the getting in parallel. Or to put it another way the bottleneck is almost certainly the getting, not the processing. Optimisation that does not affect bottleneck points is pretty much a waste of time. I suggest you prove that there is any benefit to be gained ie processing takes a significant % of the runtime before you waste your time producing the right solution to the wrong problem. cheers tachyon	[reply]
Re^2: ithreads weren't the way.. still searching by hlen (Beadle) on Oct 01, 2004 at 04:29 UTC
You're right, of course.. I'm trying to make the best out of this, but there's definitely not too much to gain - profiling the code showed about 2 secs of processing in 18 of execution. But it seemed like an interesting situation. The ithreads limitations found seemed real (BrowserUk) =), and I wondered how to do what I set myself to at first, should I find myself in a similar position where real gain was at stake. Thanks	[reply]
Re^3: ithreads weren't the way.. still searching by BrowserUk (Patriarch) on Oct 01, 2004 at 04:34 UTC
If you use a machine gun to beat the enemy over the head... Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re: ithreads weren't the way.. still searching by meredith (Friar) on Oct 01, 2004 at 03:39 UTC
Forgive me if I'm way off here, because I don't think I'm getting the whole picture. Somewhere between sleepness and gin, my mind is going... So you have to get a bunch of pages, in order, and timely. Why not just get all the pages to memory or temporary files, then do the processing pass? Threads can be a real pain to find a good way to pass complex data around, and you'll probably be uncomfortable with whatever solution you end with. (Some monks will come along and chastise you for using an unstable feature - fork() em, I say!) You might use Storable or YAML to pass data back in a scalar. Moving to passing data back in a scalar, you could use Threads::Queue to get the data back to the parent. Update: sleeplessness! `mhoward - at - hattmoward.org`	[reply]
Re^2: ithreads weren't the way.. still searching by hlen (Beadle) on Oct 01, 2004 at 03:54 UTC
>So you have to get a bunch of pages, in order, and timely. Huh, no. They're not in order. I never know what's the next page, it comes up as a link in the fetched page. As I said, sequential is a must. >Why not just get all the pages to memory or temporary files, then do the processing pass? Well, that would turn `f p f p f p' into `f f f p p p'. Not sure if that'd help much, although it could, but it's not my main point, which is doing both things at once. >fork() em, I say! Not sure how.. seems like a perfect thread situation for me, although clearly not an ithread situation. Thanks	[reply] [d/l] [select]
Re^3: ithreads weren't the way.. still searching by meredith (Friar) on Oct 01, 2004 at 04:23 UTC
What's up with your node? You can use HTML for the most part, except use <code> to wrap code, so it can be formatted and/or extracted correctly. It shouldn't be hard to keep the order that you walk the pages in, but we'll go with the threads here. You have a few problems to deal with: How do you get the processed data back, in order, to the parent? How do you know, after all processing threads have started, (and some may have already finished), when all are done, and ready for the next step. How do you handle errors in a processing thread? Among others. You could try an assembly-line thread pattern. Imagine thus: use threads; use Thread::Queue; my $work_queue = Thread::Queue->new; $work_queue->enqueue($_) for ($start .. $end); #Fill our work queue my $work_queue = Thread::Queue->new; my $fetch_thread = threads->new( \&fetch, $work_queue, $fetched_queue +); sub fetch { my $input_queue = shift; my $output_queue = shift; while ( my $fetch_this = $input_queue->dequeue ) { #Get content, put in scalar $output_queue->enqueue($content); last if ($input_queue->pending == 0) } } my $processed_queue = Thread::Queue->new; my $process_thread = threads->new( \&process, $fetched_queue, $proces +sed_queue ); sub process { my $input_queue = shift; my $output_queue = shift; while ( my $process_this = $input_queue->dequeue ) { #process data, put in scalar $output_queue->enqueue($content); last if ($input_queue->pending == 0) } } while (my $processed_data = $processed_queue->dequeue) { #Assemble into final output last if ($input_queue->pending == 0) } #Make final output [download] I know this will need some adjustment to get exactly what you want, but you get the idea, right? (code above is pseudocode, missing much. may not even be sane, read sleep and gin disclaimer above.) I think a key part is passing the $work_queue into the fetch thread, so it can add to its own input queue. Update:Also, "fork() em" is a play on "f__k em". That is to say, ignore the warnings, and continue on your quest, noble monk! `mhoward - at - hattmoward.org`	[reply] [d/l]
Re: ithreads weren't the way.. still searching by pg (Canon) on Oct 01, 2004 at 05:03 UTC
Why thread? speed? No! Thread does not give you much in this case. On one hand, it does not give you speed, on the other hand, it sucks up resources. Take a look at IO::Select.	[reply]