In addition to trying to figure out why the CPU utilization is not staying constant, I would like to find a way to dynamically add threads based on CPU utilization such that more work can be done when the CPU load falls below a certain threshold (arbitrarily, let's say 25%).

Forget the idea of throwing more threads at the problem in order to increase your cpu utilisation.

trying to figure out why the CPU utilization is not staying constant ...

In the first half of your program, less than 100% utilisation is to be expected as a) you are doing disk IO, reading the XML files from disk; b) writing to your shared %similar hash which will involve locking -- whether you are doing it; which you do not show -- or just using the implicit locking that perl does internally for its own protection.

In the second half of your program, you are a) iterating over that same shared hash, which means that internal locking comes into play. (But that's probably not a big problem for that part of your program).

Much worse is that each of your second pool of threads is performing the same bi-gramming process on each of the keys of the shared hash, for each work item it receives. AND EACH OF THOSE WORK ITEMS IS ITSELF, ONE OF THE KEYS OF THAT HASH!

That means you are doing a O(n^2) process of comparing each key against each key, and bi-graming both keys again every time. That is monumentally wasteful.

If you fix that algorithm -- ie. bi-gram each of the keys in the hash once first -- and then compare them each against the next, a single thread will probably finish much faster than your current mechanism. At which point, all the talk of dynamically spawning threads according to cpu load becomes moot.

Once you've you have made your similarity algorithm implementation efficient -- single-threaded -- it might then be worth considering whether if can be sped up using threading. But for now, ditch your second thread pool, use a single threaded loop to construct bi-gram arrays for each of your keys; and then use a second loop to compare one against the other.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


In reply to Re: dynamic number of threads based on CPU utilization by BrowserUk
in thread dynamic number of threads based on CPU utilization by mabossert

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.