in reply to passing hashes between threads

Warning: This is a long and convoluted response and the real answer comes at the bottom. But please read it all, because otherwise you will not follow the logic by which I arrive at the final response, and will probably start arguing for things I've already covered.

  1. did i get it right that enqueuing an hash always does a deep copy?

    Yes.

  2. why is this deep copy needed? If the worker thread delivers the hash is is not used any more by it.

    Typically, the feeder end of the queue will be populating a hash; enqueuing it; then populating it with new data and enqueuing that.

    If a copy was not made, by the time the reader got a hold of (what?: a reference to) the hash, the feeder will have already overwritten it with the next record. Or worse, partially overwritten it.

  3. is there a better way of passing hashes between threads than thread::queue?

    You could have a shared array of shared hashes. The feeder populates one of the hashes in the array, and then queues its index in the shared array to the other thread.

    The reader then dequeues the index and knows which of the hashes it should process.

    Of course, once the reader has processed a particular hash, you will want to empty it or remove it from the shared array to prevent memory growing continuously. Ideally, you might have the feeder push the new hash on one end of the shared array, and the reader shift it off the other. You will no doubt recognise this as a queue. Similar to your use of Thread::Queue.

    The only distinction being that instead of pushing unshared hashes on one end, that then have to be copied into shared memory, and then copying the shared hash into a an unshared hash in the reader; both ends deal directly with shared hashes and so save copying.

    You could of course do exactly the same via Thread::Queue.

  4. is there a way to "convince" thread::queue to accept hashes without deep copy?

    Yes. Push references to already shared hashes.

  5. any comments about my approach?

    Yes. Why do you want to queue hashes from one thread to the other in the first place?

    Going back to your original application rather than your wholly artificial test code, paraphrased your description is:

    1. Read a string.
    2. Convert the string to a thread-local hash.
    3. Then
      • Either: copy local hash to a shared hash;
      • Or: convert the local hash back to a string;
    4. queue the shared hash or string;
    5. Dequeue either: the shared hash and assign it to a thread-local unshared hash; or dequeue the string and convert it back to a thread-local hash.

    Maybe this is too obvious, but why not just: queue the string you read; and convert it to a hash at the reader?

But my real comment is this.

The mistake you are making is right up front in your logic,which you describe as this:

  1. set up a couple of worker threads, which parse the line and deliver back one hash per record in an output-thread::queue
  2. read input file and put lines in an input-thread:queue
  3. main process dequeues from output-queue and produces formatted output file

You say that most of the time is spent parsing -- I'd like to see evidence of that as it is very unusual for parsing to take longer than reading; but I'll accept you at your word -- so you have one thread reading from the file and you fan the input out to multiple workers to do the parsing. So far, so good.

But then you queue the hashes they build back to a single thread for further processing. Why?

You also say in your preamble:I have to parse large text files for information and produce output in an different format. One line - one data record.. That's not quite definitive, but strongly suggests that you are writing one line of output for each line of input.

It makes sense that you need to bring the flows back together to write the output file -- it avoids the problems of having multiple writers to a single output file -- but what you will be writing to the output file will be strings, not hashes!

So why ship hashes to the writer thread? Why not perform whatever processing is required to produce the output strings from the parsed hashes in the threads that created those hashes and then queue the resultant strings to the writer in the form required for output?

Finally, please forget all the timings you have done in your wholly artificial (and I must say, way overcomplicated) benchmark scripts, because you are measuring the wrong things entirely.

If you have a real application for this, post the code of a single threaded script that performs all of the required processing, along with a few (say 10 or so typical) records of input data. Once we can see (and time) the real processing involved, it might be possible to suggest ways of using threading to reduce the time taken to do it. Or possibly, to suggest that threading is the wrong solution to the problem.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: passing hashes between threads
by bago (Scribe) on Sep 18, 2011 at 11:56 UTC

    Thank you for your comprehensive and helpful answers!

    I will need to think over it - but some quick remarks:

    First some comments on the problem as a whole.
    The input contains different record types because there is transaction data and master data. Most work is finding the right data in a transactional record do some (static) recoding and cross referencing via mapping-files. This I do in ParseDok - so "parse" is a little short hand :-)
    But some data (the minor part) is depended on previous records. That means in record X a numbering change is announced and has to be applied for all following records.
    So the writer thread is not only a writing but maintaining the original order and doing some filtering and code mapping as well. Sorry - I tried to keep my post short.

    Push references to already shared hashes.
    I tried this - it was slower than the deep copy

    Why do you want to queue hashes from one thread to the other in the first place
    This was how i did it in the single-threaded version. So my first try was so put the parsing into worker threads and pass back the existing hashes. Now I am working on a new solution. Hence my questions.

    Going back to your original application rather than your wholly artificial test code
    I did not thought it artificial because it is more or less the isolated code fragment of my thread handling. It is my test/experimental code for trying out new solutions. Sub ParseDok alone is ~1.100 lines of code (shure, not in one function!). I was interested in measuring time differences for passing data between threads, to get a feel for that.

      Push references to already shared hashes. -- I tried this - it was slower than the deep copy

      Hm. Here's my test of several methods of passing hashes between threads:

      And here are the results for 10000 hashes of 100 key/value pairs:

      c:\test>TQ-b -H=200 -N=10000 Unshared hashrefs: 10.857 join/split: 3.121 freeze/thaw: 0.686 Shared hashrefs: 0.265

      Here are the results for 10000 hashes of 1000 key/value pairs:

      c:\test>TQ-b -H=2000 -N=10000 Unshared hashrefs: 117.532 join/split: 30.482 freeze/thaw: 2.886 Shared hashrefs: 0.250

      Please note not just how much faster the latter is, but that it barely changes with hashes of 10 times the size.

      I still think that your design that requires hashes to be shipped from one thread to another is the wrong approach, but you've not supplied enough information to allow me to confirm or deny that.

      not only a writing but maintaining the original order

      Hm. This is very troublesome. Quite how you are "maintaining order" when fanning out records to multiple threads and then gathering them back together is very unclear. Nothing in your posted code, and no mechanism I am aware of of will allow you to do this.

      Threads are non-deterministic. Believing you will read records back from the 'return' queue in the same order as you fed them to the 'work' queue is a very bad assumption.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: passing hashes between threads
by bago (Scribe) on Sep 18, 2011 at 12:33 UTC

    If a copy was not made, by the time the reader got a hold of (what?: a reference to) the hash, the feeder will have already overwritten it with the next record. Or worse, partially overwritten it.
    I understand that. But if I do something like:
    sub WorkerThread { while ( $element = $wq->dequeue ) { $hashref = DoWork($element); $qq->enqueue( $hashref ); } } sub DoWork{ my %hash ..... return \%hash; }

    Why is the my %hash not doing the trick. It should be new hash-reference every time DoWork() is called. Do I miss something?

      Okay. I'll put it another way. That is the way iThreads, explicitly-shared-only data model works.

      In order that the programmer needn't be concerned with locking, variables declared in one thread cannot be seen, or passed directly to, other threads in the process. When you give a reference to a non-shared variable to Thread::Queue (or any other mechanism that will convey unshared data between threads), it has to make a copy of the unshared data into the shared data-space.

      And when you read a reference out of a queue and and dereference it into a non-shared variable in the other thread, the copy process --- this time from the shared data-space to the thread-local data-space -- has to happen again.

      The way to avoid the copying, is to declare the data you wish to share in the shared space up front. You can then access it from any thread without copying it. Though you then have to concern yourself with locking.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.