Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^3: Threaded Code Not Faster Than Non-Threaded -- Why?

by BrowserUk (Patriarch)
on Jan 05, 2014 at 09:32 UTC ( [id://1069368]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Threaded Code Not Faster Than Non-Threaded -- Why?
in thread Threaded Code Not Faster Than Non-Threaded -- Why?

Incidentally, the basis of my code was straight from the documentation for threads and threads::shared. Who knew I would be so off base?

Hm. Unless there is some new or modified documentation kicking around that I've not seen, I think you have misinterpreted the documentation. I am not aware of any docs that suggest the queue architecture you are using.

Namely: each worker has its own work item queue and a work item request queue. And in order to get a work item, it has to enqueue a request (its thread id) onto the request queue; wait (many time-expensive context switches) for the thread at the other end of that queue; to dequeue that tid and then enqueue a work item; and then continue waiting until the initiating worker gets another time-slices (many more context switches) before it can actually do some work.

The problems with this -- as I tried to analogise above -- are: a) you have many worker threads all waiting on one non-worker queue; b) one non-worker queue having to satisfy the demands of many worker threads.

Are you saying that the "guard" should start polling a single queue and stuffing things into it on demand?

No. One of the purposes of queues is to avoid polling. (IMO there should not be a dequeue_nb() method for a queue; but that's a different discussion :)

The non-worker threads job is to keep enough work in the (single) work-item queue to ensure that whenever a worker needs a work item, there is one immediately available. It doesn't have to ask for one, it just grabs one.

Its only other responsibility is to ensure that it doesn't overfill the queue. The 'damage' this does is to use more memory than is necessary. Basically whenever it (the non-worker, producer thread) has a new work item available, it checks the size of the work-item queue, and if it is bigger than some threshold value (2 or 3 times the number of workers is a good starting point), it simply sleeps for a short period to allow the worker threads to diminish the queue a little further.

Although this may sound like polling, it is quite different in that there is no single event or synchronisation involved. That is, it is not waiting for the queue length to fall to some exact number; but rather just until it falls below some number. Provided that number is sufficient to ensure that no worker is ever blocked waiting for a new item, this ensures that the workers run at their full potential speed without risk memory overruns by over stuffing the queue.

What to do... Could you steer me in the way of an implementation like the one you suggest?

The first thing is that all your workers should be reading from a single queue. After all they are all identical, so it doesn't matter which worker processes which work item.

That does away with the separate, per-worker item queues and the requests (tid) queue. The non-worker just stuffs every item it gets onto the common worker queue as it gets them; and only pauses briefly -- a single timeslice period ~10ms is a good starting point -- iff the queue is starting to overfill. If you get the length of the queue right, this will be a rare event.

This way each thread can -- assuming the nature of the work allows it -- process as many work items as it can in each timeslice it gets. Thus you maximise the timeslice utilisation and minimise the expensive context switches. This architecture also ensures that when some workers run more slowly -- processing bigger files -- the other workers can continue to run flat out if they get a bunch of small ones. Ie. The single queue/multiple workers scheme becomes self-load-balancing.

Your 7 step overview sounds reasonable except for the multiple queues and work item counting. Fix that and it should improve your throughput somewhat.

But overall, I think this is not a particularly good task for multi-tasking -- whether by multi-threading, multiprocessing, user-space threads (coroutines) or any other mechanism.

Ultimately, the whole thing is IO-constrained -- including the digesting of the files; you can only digest the blocks as fast as you can read them from disk and the time taken to read a (chunk of a) file from disk will always be many times greater than the time required to calculate the digest.

And if your multi-threading scheme means that you end up having different threads reading different files from the same physical spindle concurrently, you will actually slow things down because the disk head will be seeking back and forth between different files (thus disk tracks) trying to keep all your threads supplied with data.

And if you try to avoid that by using a semaphore to allow only one thread at a time to read from disk, you've effectively serialised the entire process and will see little or no gain from the threading.

I'm afraid there is simply no simple way to achieve substantial throughput gains from multi-tasking this type of IO-bound processing unless the filesystem being processes is itself distributed. That is to say: unless the filesystem is located on a big box of simple disks (eg. a RAIDed NAS, SAN or similar), then you'll find it very hard to achieve much better throughput than a well written, single-tasking, single pass through the filesystem gathering the stat info; and then a second, single-tasking, single pass digesting just those files that are still potential duplicates after eliminating the obviously different ones.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^4: Threaded Code Not Faster Than Non-Threaded -- Why?
by Tommy (Chaplain) on Jan 05, 2014 at 18:32 UTC
    Hm. Unless there is some new or modified documentation kicking around that I've not seen, I think you have misinterpreted the documentation. I am not aware of any docs that suggest the queue architecture you are using.

    As mentioned in an above reply...

    ...I provided the wrong link. The code I wrote comes from this code taken directly from the examples directory of the threads CPAN distro by JDHEDDEN. It's called pool_reuse.pl

    So I can't take full credit for it. I wouldn't have thought to use a queue for each worker. It didn't make sense to me. I figured it was a safe assumption that they guy who wrote threads knew what he was doing and so I took a leap and used the same (almost line for line) code as the core of my threading setup. After all, his documentation said, (I summarize) that if you don't have a reusable thread pool then you are probably going to leak RAM badly, and this code is how you solve that problem.

    Now all kinds of holes have been shot through it, and this is a good thing for me. I'll never get anywhere in speeding up the code by paying no heed to the problems with it. I'm quite delighted to see how I've erred, so that I can improve it.

    Although this may sound like polling, it is quite different in that there is no single event or synchronisation involved. That is, it is not waiting for the queue length to fall to some exact number; but rather just until it falls below some number. Provided that number is sufficient to ensure that no worker is ever blocked waiting for a new item, this ensures that the workers run at their full potential speed without risk memory overruns by over stuffing the queue.

    Very interesting. It seems like it will take a bit of experimentation to figure out how long to sleep and how deep to keep the queue.

    That does away with the separate, per-worker item queues and the requests (tid) queue.

    ...And that seems like a lot of overhead, more so every time I read over this. I'm eager to see the effects of (re)moving things around in the code to accomplish this.

    Ultimately, the whole thing is IO-constrained...

    Yes. Precisely. The number crunching involved in calculating digests is what I really wanted to spread across cores. The IO isn't going to go faster because of the threading, but the digest processing does. If I can eliminate the overhead I've introduced into my own code through inefficient thread management, I'm sure I'll see noticeable improvements. By altering my approach even more, I might be able to cut down on the amount of digesting ever performed. This makes the threading even less valuable in the exercise, but not in the lesson learned.

    Thank you BrowserUk

    Tommy
    A mistake can be valuable or costly, depending on how faithfully you pursue correction
      It seems like it will take a bit of experimentation to figure out how long to sleep and how deep to keep the queue.

      These are not as critical as you might first think.

      1. sleep: A thread that calls $Q->pending once every millisecond will consume so few cycles that it barely registers on taskmanager/top.

        Try this:

        perl -MThread::Queue -E"$Q=new Thread::Queue; $Q->pending while Win32: +:Sleep( 1 )"

        On my system, that shows up in Process Manager as 0.0%.

        So, running it at a frequency of once every timeslice -- ~10 milliseconds -- will consume negligible cpu whilst ensuring that the queue is (re)populated in a timely manner.

      2. Queue Length: A queue, shared between 16 threads, populated with a million longish paths, will consume less than 200MB.

        Try:

        perl -Mthreads -MThread::Queue -E"async{ sleep }->detach for 1 .. 16; +$Q=new Thread::Queue; $Q->enqueue( qq[/a/longish/path/and/filename$_] + ) for 1 .. 1e6; sleep;"

        Which given a typical modern system with 4-16 GB is nothing to concern ourselves with; and is at least 3 order of magnitude greater than I'd recommend as a starting point (100 for 16 threads).

      For IO-bound process like this, tuning either or both variables over quiet a large range of values will have little or no effect on overall throughput.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        The results are in. I forked and refactored the code to follow the pattern you laid out (BrowserUk). The threaded part of the code is over 2X faster now! Thanks again!

        Tommy
        A mistake can be valuable or costly, depending on how faithfully you pursue correction

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1069368]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-03-29 14:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found