in reply to passing hashes between threads
Warning: This is a long and convoluted response and the real answer comes at the bottom. But please read it all, because otherwise you will not follow the logic by which I arrive at the final response, and will probably start arguing for things I've already covered.
did i get it right that enqueuing an hash always does a deep copy?
Yes.
why is this deep copy needed? If the worker thread delivers the hash is is not used any more by it.
Typically, the feeder end of the queue will be populating a hash; enqueuing it; then populating it with new data and enqueuing that.
If a copy was not made, by the time the reader got a hold of (what?: a reference to) the hash, the feeder will have already overwritten it with the next record. Or worse, partially overwritten it.
is there a better way of passing hashes between threads than thread::queue?
You could have a shared array of shared hashes. The feeder populates one of the hashes in the array, and then queues its index in the shared array to the other thread.
The reader then dequeues the index and knows which of the hashes it should process.
Of course, once the reader has processed a particular hash, you will want to empty it or remove it from the shared array to prevent memory growing continuously. Ideally, you might have the feeder push the new hash on one end of the shared array, and the reader shift it off the other. You will no doubt recognise this as a queue. Similar to your use of Thread::Queue.
The only distinction being that instead of pushing unshared hashes on one end, that then have to be copied into shared memory, and then copying the shared hash into a an unshared hash in the reader; both ends deal directly with shared hashes and so save copying.
You could of course do exactly the same via Thread::Queue.
is there a way to "convince" thread::queue to accept hashes without deep copy?
Yes. Push references to already shared hashes.
any comments about my approach?
Yes. Why do you want to queue hashes from one thread to the other in the first place?
Going back to your original application rather than your wholly artificial test code, paraphrased your description is:
Maybe this is too obvious, but why not just: queue the string you read; and convert it to a hash at the reader?
The mistake you are making is right up front in your logic,which you describe as this:
- set up a couple of worker threads, which parse the line and deliver back one hash per record in an output-thread::queue
- read input file and put lines in an input-thread:queue
- main process dequeues from output-queue and produces formatted output file
You say that most of the time is spent parsing -- I'd like to see evidence of that as it is very unusual for parsing to take longer than reading; but I'll accept you at your word -- so you have one thread reading from the file and you fan the input out to multiple workers to do the parsing. So far, so good.
But then you queue the hashes they build back to a single thread for further processing. Why?
You also say in your preamble:I have to parse large text files for information and produce output in an different format. One line - one data record.. That's not quite definitive, but strongly suggests that you are writing one line of output for each line of input.
It makes sense that you need to bring the flows back together to write the output file -- it avoids the problems of having multiple writers to a single output file -- but what you will be writing to the output file will be strings, not hashes!
So why ship hashes to the writer thread? Why not perform whatever processing is required to produce the output strings from the parsed hashes in the threads that created those hashes and then queue the resultant strings to the writer in the form required for output?
Finally, please forget all the timings you have done in your wholly artificial (and I must say, way overcomplicated) benchmark scripts, because you are measuring the wrong things entirely.
If you have a real application for this, post the code of a single threaded script that performs all of the required processing, along with a few (say 10 or so typical) records of input data. Once we can see (and time) the real processing involved, it might be possible to suggest ways of using threading to reduce the time taken to do it. Or possibly, to suggest that threading is the wrong solution to the problem.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: passing hashes between threads
by bago (Scribe) on Sep 18, 2011 at 11:56 UTC | |
by BrowserUk (Patriarch) on Sep 18, 2011 at 12:22 UTC | |
|
Re^2: passing hashes between threads
by bago (Scribe) on Sep 18, 2011 at 12:33 UTC | |
by BrowserUk (Patriarch) on Sep 18, 2011 at 12:52 UTC |