in reply to Re^9: Strange memory leak using just threads (forks.pm)
in thread Strange memory leak using just threads

About 10 seconds to edit the test; much less to run it ( < 0.01s per core): (

Updated: Improved the tests; de-obfuscated the code.

perl -Mthreads="stack_size,4096" -Mthreads::shared -MTime::HiRes=time -wE" $N = 64; $t = time; my$c :shared = 0; async( sub{ ++$c; sleep 10 } )->detach for 1..$N; 1 while $c < $N; say time-$t; sleep 10 " 0.578999996185303 perl -Mthreads="stack_size,4096" -Mthreads::shared -MTime::HiRes=time -wE" $N = 256; $t = time; my $c :shared = 0; async( sub{ ++$c; sleep 10 } )->detach for 1..$N; 1 while $c < $N; say time - $t; sleep 10 " 2.57699990272522

Editing the number is all that would be required.

Now. How long will it take to re-tune your POE-behemoth based application when moving from running on 4 cores to 256 cores?

From my experience of tuning event-driven systems for different hardware: weeks.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^11: Strange memory leak using just threads (forks.pm)
by zwon (Abbot) on Sep 22, 2010 at 08:15 UTC

    I was talking about money actually - 256 cores system will be very expensive. But looking onto your example... from my experience it requires quite a lot of time to find the problem in the threaded application when somebody forget to lock shared variable before changing its value.

    Upd: And the tests are not correct BTW. Time required for creating new thread depends on the size of all existing variables.

      I was talking about money actually - 256 cores system will be very expensive.

      I'm not sure about the relevance of that?

      When it is affordable--it's already available; the IBM Power 795 can have 256 cores and 4 hardware threads per core) giving a 1024 thread processor in a box--a threaded solution will port to it with the change of one number.

      Two years or so ago, 4 cores were horribly expensive. I now have one sitting in front of me that cost about the same as a high end smartphone!

      Next year will see the release of 16-core commodity processors, some with 2-way hyper threading. When my next "refresh" cycle comes around in 2012, I'll be looking for 16-cores for dirt cheap price. I'll probably have to wait until towards the end of the year, rather than the beginning. I'll also be looking to have 512/1024 core GPU in the same box for the same price.

      I know POE can span clusters, but clusters don't make sense. Other than as a stop gap solution until a multi-core fitting your aspirations or price becomes available.

      For why, there is a wonderful example of the problem cited in an article I read just today. The fourth para (starting "For example") is the crux of the cluster problem (albeit this example is gpus).

      Just as scaling with processes tops out very quickly because of the costs of IPC; so clustering tops out even more quickly because of the even higher costs of inter-box network comms (INC). And as processors become more efficient at processing a given volume of data, so the ratio of non-productive IPC and INC to useful CPU work grows.

      And the tests are not correct BTW. Time required for creating new thread depends on the size of all existing variables.

      Yes. That is an annoying detail of the ithreads implementation. But, it is quite easy to avoid; you just spawn your workers early, and have them require rather than use what they (individually) need.

      But forks face a similar problem--indeed, the copy problem exists in iThreads specifically because of the attempt to make threading look like fork. If you need to share data between forks, there is still a "duplication penalty", although it is disguised by coming at use-time, rather than spawn-time.

      COW may appear to avoid the need to duplicate data memory, but it just means it gets duplicated piecemeal on use, rather than in one lump up front. Even if it is "read-only" in the sense of your program, it is often modified by simple "read-only" accesses.

      Use $#array and a 4k page of COW'd memory get copied. Use a regex on a string that alters its pos, and another 4k page gets copied. Even if the string is only 4 characters. Use a single scalar, instantiated as a number, in a string context, and another 4k page gets copied. Use each, keys, or values on a hash, and (at least) another 4k gets copied.

      These piecemeal requests for another single page of VM followed by a 4k copy through ring 0 add up to far more than an up-front single request for as many pages of VM as are required followed by a single ring 3 rep mov.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Use $#array and a 4k page of COW'd memory get copied...and (at least) another 4k gets copied

        This is only true if all these variables reside in different memory pages, otherwise just one page is copied (and afterwards, the copy is modified). As such, this example of events is misleading.

        These piecemeal requests for another single page of VM followed by a 4k copy through ring 0 add up to far more than an up-front single request for as many pages of VM as are required followed by a single ring 3 rep mov.

        Umm, when put as the general case, no. Moving stuff around in RAM is expensive. The CPU/MMU operations required to throw a page fault and allocate memory are insignificant compared to the time it takes to actually copy the data in memory. Which means that there is only a very slight difference (over the lifetime of a process) between copying the whole process address space in one move or doing so whenever a page is dirtied.

        Also, for most use cases, the process will not have to read/write access its entire memory throughout its lifetime. For the far more common usage scenario, when part of the memory remains untouched and therefore shared, copy-on-write is far more efficient than copy-all-at-once (even only talking about speed, this is obviously even more true for process memory consumption).


        All dogma is stupid.
        an up-front single request for as many pages of VM as are required followed by a single ring 3 rep mov.

        I don't think so. Each variable should be copied separately, and you have to fix the references. So it's much more than single memory allocation.

        Yes. That is an annoying detail of the ithreads implementation. But, it is quite easy to avoid; you just spawn your workers early, and have them require rather than use what they (individually) need.

        Sorry, don't see how this can help. Workers usually all need the same set of modules.