Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

So I put together the reference code for the hackathon that's going on. One version uses threading, and another version does not. The purpose of the code is to detect (and optionally delete) duplicate files in a given directory tree. You can watch the code in action in a small video here.

You'd be right to ask why I'd want to make this multi-threaded when it has to do with IO. The answer is quite simply that after you've eliminated obvious non-dupes, you have to start comparing the files with a real means of differentiation, namely calculating file digests--a cpu-intensive task that comes after you're done with the IO. I use xxhash for this and I was hoping that the threading would help me concurrently do the heavy lifting in that area, spreading the number crunching across 8 cpu cores. Surely that would make it faster, right?

The code is on github here. The code isn't glorious, and in a few ways I'm unhappy with it but time constraints keep me from moving what should be modular over into a module and refactoring the code to call out to that. If I can't sleep tonight maybe I'll do just that, but...

As expected, the threaded version uses more RAM, however what is unexpected is that it gains barely any advantage over the non-threaded code in speed. In short executions, it's actually slower (I guess that's the thread management overhead). In very long executions of the code, I'd expect to see more of a boost, but the boost just isn't there. By using concurrency I thought I'd gain more of an advantage in speed, but this just isn't the case. I'm asking folks who know more about threading if they could tell me what I'm doing wrong.

I've considered the possibility that something must be up with the underlying disk storage. IO blocking for example. Well I can confirm via sar/iostat that both versions of the code push the raid10 array to maximum expected performance levels. I'd like to believe that that's all there is to it, but I get the same lack of "boost" when I run the code on a ramdisk. Seek time and IO blocking become irrelevant in such a scenario. And yet, still no performance gains with threads.

The code is too big to put into a post without breaking all manner of web etiquette laws, but it isn't hard to grok if you open it in vim and fold on the subs. You can see it all laid out. There's the sub that creates the thread pool, a "worker" sub, and a sub to wait on and end the threads in the pool.

So while I start nytprof'ing the code, could you take a peek and let me know if it looks like I'm making any _obvious_ mistakes? Any insight is appreciated, and I thank you in advance for your suggestions.

A mistake can be valuable or costly, depending on how faithfully you pursue correction

In reply to Threaded Code Not Faster Than Non-Threaded -- Why? by Tommy

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2023-03-21 18:35 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (60 votes). Check out past polls.