Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Threaded Code Not Faster Than Non-Threaded -- Why?

by Tommy (Chaplain)
on Jan 05, 2014 at 02:08 UTC ( [id://1069338]=perlquestion: print w/replies, xml ) Need Help??

Tommy has asked for the wisdom of the Perl Monks concerning the following question:

So I put together the reference code for the dfw.pm.org hackathon that's going on. One version uses threading, and another version does not. The purpose of the code is to detect (and optionally delete) duplicate files in a given directory tree. You can watch the code in action in a small video here.

You'd be right to ask why I'd want to make this multi-threaded when it has to do with IO. The answer is quite simply that after you've eliminated obvious non-dupes, you have to start comparing the files with a real means of differentiation, namely calculating file digests--a cpu-intensive task that comes after you're done with the IO. I use xxhash for this and I was hoping that the threading would help me concurrently do the heavy lifting in that area, spreading the number crunching across 8 cpu cores. Surely that would make it faster, right?

The code is on github here. The code isn't glorious, and in a few ways I'm unhappy with it but time constraints keep me from moving what should be modular over into a module and refactoring the code to call out to that. If I can't sleep tonight maybe I'll do just that, but...

As expected, the threaded version uses more RAM, however what is unexpected is that it gains barely any advantage over the non-threaded code in speed. In short executions, it's actually slower (I guess that's the thread management overhead). In very long executions of the code, I'd expect to see more of a boost, but the boost just isn't there. By using concurrency I thought I'd gain more of an advantage in speed, but this just isn't the case. I'm asking folks who know more about threading if they could tell me what I'm doing wrong.

I've considered the possibility that something must be up with the underlying disk storage. IO blocking for example. Well I can confirm via sar/iostat that both versions of the code push the raid10 array to maximum expected performance levels. I'd like to believe that that's all there is to it, but I get the same lack of "boost" when I run the code on a ramdisk. Seek time and IO blocking become irrelevant in such a scenario. And yet, still no performance gains with threads.

The code is too big to put into a post without breaking all manner of web etiquette laws, but it isn't hard to grok if you open it in vim and fold on the subs. You can see it all laid out. There's the sub that creates the thread pool, a "worker" sub, and a sub to wait on and end the threads in the pool.

So while I start nytprof'ing the code, could you take a peek and let me know if it looks like I'm making any _obvious_ mistakes? Any insight is appreciated, and I thank you in advance for your suggestions.

Tommy
A mistake can be valuable or costly, depending on how faithfully you pursue correction
  • Comment on Threaded Code Not Faster Than Non-Threaded -- Why?

Replies are listed 'Best First'.
Re: Threaded Code Not Faster Than Non-Threaded -- Why?
by BrowserUk (Patriarch) on Jan 05, 2014 at 03:03 UTC
    ... take a peek and let me know if it looks like I'm making any _obvious_ mistakes?

    Without having read the rest of the code, the problem is identified right here:

    sub worker { my $work_queue = shift; my $tid = threads->tid; local $/; WORKER: while ( !$thread_term ) { # signal to the thread poolq that we are ready to work $pool_queue->enqueue( $tid ); # wait for some filename to be put into my work queue my $file = $work_queue->dequeue; ...

    Imagine a post office where there is bunch of counters and a locked door with a guard.

    When a counter clerk is free, they shout to the guard at the door, he unlocks the door, let's one person in, tells them which counter to go to and locks the door behind them. Of course, quite often, two or more of the counters become free simultaneously, and the respective clerks both yell at the guard for another customer; but he can only service one at a time, so the others have to wait. And the guard tends to service the person that shouts loudest, which means old Miss Primenproper at the far end with small voice and proprietary nature, often finds herself with nothing to do for long periods....

    The point is, you've created both a bottleneck -- the guard/door (poolQ/workQ) -- and designed in a four-way, inter-thread, handshake (synchronisation; lock-step). That is, when a thread finishes processing one file -- which perhaps took 1/10th of its designated timeslice -- in order to proceed it has to:

    1. post a "ready to work" message on the poolQ and then relinquish the rest of this timeslice by going into a wait state until the guard sees that message and replies.
    2. At least two thread switches are required. 1) for the guard to see the message and respond; another for the waiting thread to receive the work item.
    3. And if the guard (poolQ) is busy, it may take many thread-swaps before he sees this workers request; and many more before the clerk (worker thread) gets a timeslice and gets the message back.

    And in that design, you've reduced your queues to single element at a time buffers; which defeats their purpose.

    The solution is to desynchronize your threads:

    • The guard (feeder thread) only concerns itself with ensuring that the 'internal queue' (workQ) doesn't get overly full. It has some threshold -- say N where N is the number of worker threads -- and when the workQ falls below that number it allows another N people in to join the internal queue (workQ).
    • The clerks (work threads) all get new customers (filenames) from that same single internal queue (workQ), which means that if they are capable of processing 2 (or N dozen files) in a single timeslice, they do not have to enter wait-states to do so.

    Hopefully that analogy will allow you to see where/what to change in your code. If not, ask questions.

    Update: Note: I'm quite uneasy by this assertion:

    You'd be right to ask why I'd want to make this multi-threaded when it has to do with IO. The answer is quite simply that after you've eliminated obvious non-dupes, you have to start comparing the files with a real means of differentiation, namely calculating file digests--a cpu-intensive task that comes after you're done with the IO.

    This implies that your are reading the entire file and digesting it for every file -- regardless of whether there is another file already seen that has the same date/time/first/last/middle bytes.

    Again without having digested your entire program in detail, it seem likely that your architecture is letting you down here.

    My initial thoughts for an architecture are:

    One thread scans the filesystem -- there is nothing to be gained from multiple threads thrashing the directories unless you a) have multiple physical spindles; and can b) confine each scanner to a single spindle. (This is feasible on NTFS systems but not so easy on unified filesystems I think).

    Only if two files with the same size/date/checkbytes are found, do those two files get digested by worker threads. If one of them has already been digested; only the one need be done.

    Anyway, that's an ill-thought through notion, but it might trigger your synapses.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      BrowserUk I am so glad you answered and took the time to look over my code. Without sounding sycophantic, I've long admired your posts, specifically about threads. They've helped me understand the concepts and avoid common pitfals. So, Hmmmm...

      The solution is to desynchronize your threads:

      • The guard (feeder thread) only concerns itself with ensuring that the 'internal queue' (workQ) doesn't get overly full. It has some threshold -- say N where N is the number of worker threads -- and when the workQ falls below that number it allows another N people in to join the internal queue (workQ).
      • The clerks (work threads) all get new customers (filenames) from that same single internal queue (workQ), which means that if they are capable of processing 2 (or N dozen files) in a single timeslice, they do not have to enter wait-states to do so.

      ...It's gonna take me a few minutes to wrap my head around how this would be implemented in code. I'm not sure how to do it and avoid the memory issues mentioned in the documentation for threads. Incidentally, the basis of my code was straight from the documentation for threads and threads::shared. Who knew I would be so off base?

      Are you saying that the "guard" should start polling a single queue and stuffing things into it on demand? How long and how often would I have to usleep to avoid an underrun on one hand and excessive interrupts on the other? That in and of itself seems like it could vary wildly from one environment/server/workstation to another. I'm not sure I understand how to go about it correctly.

      I actually did consider the fact that the thread management/locking/queueing 1 at a time was killing performance. This is why I started having the "guard" start stuffing $opts>{qsize} number of items into the workers' queues at a time (default: 30). I saw a noticeable improvement.

      What to do... Could you steer me in the way of an implementation like the one you suggest? Google seems more interested in "Coro vs Threads" wars and other silliness that doesn't help me.

      UPDATE Re digesting:

      This implies that your are reading the entire file and digesting it for every file -- regardless of whether there is another file already seen that has the same date/time/first/last/middle bytes.

      The code:

      1. Traverses the filesystem
      2. Groups same-size files, tossing out the rest. This is not threaded
      3. Takes each group and reads the first few bytes each file, creating sub-groups based on the bytes read. Then it removes sub-groups with a single element, thereby "throwing out" the non-similar files from the parent group
      4. Makes a second pass at the above, but at the end of the file (the efficiency of this second pass is debatable but shows good results)
      5. Adds up the final N number of files to be processed in a :shared variable
      6. Creates thread pool with worker threads and shoves 30 files at a time into their queues and waits until the threads have incremented the number of files they've processed to equal N
        • The threads digest the files in their queues in their entirety (this is bad?)
      7. Main thread signals to the threads that they are done by ending their queues and finally joins them
      Tommy
      A mistake can be valuable or costly, depending on how faithfully you pursue correction
        Incidentally, the basis of my code was straight from the documentation for threads and threads::shared. Who knew I would be so off base?

        Hm. Unless there is some new or modified documentation kicking around that I've not seen, I think you have misinterpreted the documentation. I am not aware of any docs that suggest the queue architecture you are using.

        Namely: each worker has its own work item queue and a work item request queue. And in order to get a work item, it has to enqueue a request (its thread id) onto the request queue; wait (many time-expensive context switches) for the thread at the other end of that queue; to dequeue that tid and then enqueue a work item; and then continue waiting until the initiating worker gets another time-slices (many more context switches) before it can actually do some work.

        The problems with this -- as I tried to analogise above -- are: a) you have many worker threads all waiting on one non-worker queue; b) one non-worker queue having to satisfy the demands of many worker threads.

        Are you saying that the "guard" should start polling a single queue and stuffing things into it on demand?

        No. One of the purposes of queues is to avoid polling. (IMO there should not be a dequeue_nb() method for a queue; but that's a different discussion :)

        The non-worker threads job is to keep enough work in the (single) work-item queue to ensure that whenever a worker needs a work item, there is one immediately available. It doesn't have to ask for one, it just grabs one.

        Its only other responsibility is to ensure that it doesn't overfill the queue. The 'damage' this does is to use more memory than is necessary. Basically whenever it (the non-worker, producer thread) has a new work item available, it checks the size of the work-item queue, and if it is bigger than some threshold value (2 or 3 times the number of workers is a good starting point), it simply sleeps for a short period to allow the worker threads to diminish the queue a little further.

        Although this may sound like polling, it is quite different in that there is no single event or synchronisation involved. That is, it is not waiting for the queue length to fall to some exact number; but rather just until it falls below some number. Provided that number is sufficient to ensure that no worker is ever blocked waiting for a new item, this ensures that the workers run at their full potential speed without risk memory overruns by over stuffing the queue.

        What to do... Could you steer me in the way of an implementation like the one you suggest?

        The first thing is that all your workers should be reading from a single queue. After all they are all identical, so it doesn't matter which worker processes which work item.

        That does away with the separate, per-worker item queues and the requests (tid) queue. The non-worker just stuffs every item it gets onto the common worker queue as it gets them; and only pauses briefly -- a single timeslice period ~10ms is a good starting point -- iff the queue is starting to overfill. If you get the length of the queue right, this will be a rare event.

        This way each thread can -- assuming the nature of the work allows it -- process as many work items as it can in each timeslice it gets. Thus you maximise the timeslice utilisation and minimise the expensive context switches. This architecture also ensures that when some workers run more slowly -- processing bigger files -- the other workers can continue to run flat out if they get a bunch of small ones. Ie. The single queue/multiple workers scheme becomes self-load-balancing.

        Your 7 step overview sounds reasonable except for the multiple queues and work item counting. Fix that and it should improve your throughput somewhat.

        But overall, I think this is not a particularly good task for multi-tasking -- whether by multi-threading, multiprocessing, user-space threads (coroutines) or any other mechanism.

        Ultimately, the whole thing is IO-constrained -- including the digesting of the files; you can only digest the blocks as fast as you can read them from disk and the time taken to read a (chunk of a) file from disk will always be many times greater than the time required to calculate the digest.

        And if your multi-threading scheme means that you end up having different threads reading different files from the same physical spindle concurrently, you will actually slow things down because the disk head will be seeking back and forth between different files (thus disk tracks) trying to keep all your threads supplied with data.

        And if you try to avoid that by using a semaphore to allow only one thread at a time to read from disk, you've effectively serialised the entire process and will see little or no gain from the threading.

        I'm afraid there is simply no simple way to achieve substantial throughput gains from multi-tasking this type of IO-bound processing unless the filesystem being processes is itself distributed. That is to say: unless the filesystem is located on a big box of simple disks (eg. a RAIDed NAS, SAN or similar), then you'll find it very hard to achieve much better throughput than a well written, single-tasking, single pass through the filesystem gathering the stat info; and then a second, single-tasking, single pass digesting just those files that are still potential duplicates after eliminating the obviously different ones.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Threaded Code Not Faster Than Non-Threaded -- Why? (meat)
by Anonymous Monk on Jan 05, 2014 at 05:11 UTC

    I don't understand this part , I think you have too many queues, and you should probably have two, one for files to process , and one for results of that processing

    Also the closures don't make sense to me

    The meat of the threading code :)

    sub create_thread_pool { my $files_to_digest = shift; threads->create( threads_progress => $files_to_digest ); for ( 1 .. $opts->{threads} ) { my $thread_queue = Thread::Queue->new; my $worker_thread = threads->create( worker => $thread_queue ); $worker_queues->{ $worker_thread->tid } = $thread_queue; } lock $threads_init; $threads_init++; } sub get_dup_digests { my $size_dups = shift; my $dup_count = 0; my $queued = 0; $dup_count += @$_ for map { $size_dups->{ $_ } } keys %$size_dups; # creates thread pool, passing in as an argument the number of file +s # that the pool needs to digest. this is NOT equivalent to the numb +er # of threads to be created; that is determined in the options ($opt +s) create_thread_pool( $dup_count ); sub get_tid { my $tid = $pool_queue->dequeue; return $tid; } my $tid = get_tid(); SIZESCAN: for my $size ( keys %$size_dups ) { my $group = $size_dups->{ $size }; for my $file ( @$group ) { $worker_queues->{ $tid }->enqueue( $file ) if !$thread_term; $queued++; $tid = get_tid() and $queued = 0 if $queued == $opts->{qsize} + - 1; last SIZESCAN unless defined $tid; } } # wait for threads to finish while ( $d_counter < $dup_count ) { usleep 1000; # sleep for 1 millisecond } # ...tell the threads to exit end_wait_thread_pool(); # get rid of non-dupes delete $digests->{ $_ } for grep { @{ $digests->{ $_ } } == 1 } keys %$digests; my $priv_digests = {}; # sort dup groupings for my $digest ( keys %$digests ) { my @group = @{ $digests->{ $digest } }; $priv_digests->{ $digest } = [ sort { $a cmp $b } @group ]; } undef $digests; return $priv_digests; }
      I don't understand this part , I think you have too many queues, and you should probably have two, one for files to process , and one for results of that processing

      It comes straight from here. Which came straight from here. If my implementation of that core documentation code is flawed, I really want to see an implementation that isn't. I'm totally serious. I want to learn how to do it right.

      Tommy
      A mistake can be valuable or costly, depending on how faithfully you pursue correction

        I think you're misreading it - it includes examples of creating queues, but I don't see it implying that you need multiple queues.

        I have an example of a 'queue based' worker thread model: A basic 'worker' threading model

        Personally I'd be thinking in terms of using 'File::Find' to traverse your filesystem linearly, but have if feed a queue with files that need more detailed inspection. The two most expensive operations in this process are: Filesystem traversal (which is hard to optimise without messing with disks and filesystem layout). Also - reading the files for calculating their hashes - the reading of files may well be more 'expensive' than doing the sums. My thought would be to ask if you can do partial hashes, iteratively - if you work through a file say 'one block' at a time (varies depending on filesystem) you have a single read IO operation, that you then hash - and can work through a file if it's longer, until hashes don't match. If the file is a genuine dupe, then you'll still have to read the whole lot, but if it's not it'll discard faster.

Re: Threaded Code Not Faster Than Non-Threaded -- Why?
by sundialsvc4 (Abbot) on Jan 06, 2014 at 03:32 UTC

    Also bear in mind that since this appears to be “calculating file digests,” this is necessarily an I/O-intensive task, not a CPU-intensive one:   the content of the file must be retrieved from disk in order to be digested.   The ruling constraint will be the capacity of the disk-drive to seek ... to move the read/write head from one track to another.   There will be some opportunity for threading here, but perhaps not nearly so much as you might initially have supposed.

Re: Threaded Code Not Faster Than Non-Threaded -- Why?
by kschwab (Vicar) on Jan 06, 2014 at 22:22 UTC
    Not related to your specific question, but perhaps related to the hackathon. I'm not sure if the hackathon rules would allow it, but there's an old LD_PRELOAD hack for ext filesystems that speeds up readdir() calls by doing them in inode order (which is somewhat correlated to disk layout). It's here: spd_readdir.c Thought it was worth mentioning since speeding up I/O would probably have a more meaningful effect.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1069338]
Approved by boftx
Front-paged by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-19 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found