in reply to Re^8: Strange memory leak using just threads (forks.pm)
in thread Strange memory leak using just threads

How much will it cost to replace 4 with 64? With 128? With 256?

  • Comment on Re^9: Strange memory leak using just threads (forks.pm)

Replies are listed 'Best First'.
Re^10: Strange memory leak using just threads (forks.pm)
by BrowserUk (Patriarch) on Sep 21, 2010 at 06:06 UTC

    About 10 seconds to edit the test; much less to run it ( < 0.01s per core): (

    Updated: Improved the tests; de-obfuscated the code.

    perl -Mthreads="stack_size,4096" -Mthreads::shared -MTime::HiRes=time -wE" $N = 64; $t = time; my$c :shared = 0; async( sub{ ++$c; sleep 10 } )->detach for 1..$N; 1 while $c < $N; say time-$t; sleep 10 " 0.578999996185303 perl -Mthreads="stack_size,4096" -Mthreads::shared -MTime::HiRes=time -wE" $N = 256; $t = time; my $c :shared = 0; async( sub{ ++$c; sleep 10 } )->detach for 1..$N; 1 while $c < $N; say time - $t; sleep 10 " 2.57699990272522

    Editing the number is all that would be required.

    Now. How long will it take to re-tune your POE-behemoth based application when moving from running on 4 cores to 256 cores?

    From my experience of tuning event-driven systems for different hardware: weeks.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I was talking about money actually - 256 cores system will be very expensive. But looking onto your example... from my experience it requires quite a lot of time to find the problem in the threaded application when somebody forget to lock shared variable before changing its value.

      Upd: And the tests are not correct BTW. Time required for creating new thread depends on the size of all existing variables.

        I was talking about money actually - 256 cores system will be very expensive.

        I'm not sure about the relevance of that?

        When it is affordable--it's already available; the IBM Power 795 can have 256 cores and 4 hardware threads per core) giving a 1024 thread processor in a box--a threaded solution will port to it with the change of one number.

        Two years or so ago, 4 cores were horribly expensive. I now have one sitting in front of me that cost about the same as a high end smartphone!

        Next year will see the release of 16-core commodity processors, some with 2-way hyper threading. When my next "refresh" cycle comes around in 2012, I'll be looking for 16-cores for dirt cheap price. I'll probably have to wait until towards the end of the year, rather than the beginning. I'll also be looking to have 512/1024 core GPU in the same box for the same price.

        I know POE can span clusters, but clusters don't make sense. Other than as a stop gap solution until a multi-core fitting your aspirations or price becomes available.

        For why, there is a wonderful example of the problem cited in an article I read just today. The fourth para (starting "For example") is the crux of the cluster problem (albeit this example is gpus).

        Just as scaling with processes tops out very quickly because of the costs of IPC; so clustering tops out even more quickly because of the even higher costs of inter-box network comms (INC). And as processors become more efficient at processing a given volume of data, so the ratio of non-productive IPC and INC to useful CPU work grows.

        And the tests are not correct BTW. Time required for creating new thread depends on the size of all existing variables.

        Yes. That is an annoying detail of the ithreads implementation. But, it is quite easy to avoid; you just spawn your workers early, and have them require rather than use what they (individually) need.

        But forks face a similar problem--indeed, the copy problem exists in iThreads specifically because of the attempt to make threading look like fork. If you need to share data between forks, there is still a "duplication penalty", although it is disguised by coming at use-time, rather than spawn-time.

        COW may appear to avoid the need to duplicate data memory, but it just means it gets duplicated piecemeal on use, rather than in one lump up front. Even if it is "read-only" in the sense of your program, it is often modified by simple "read-only" accesses.

        Use $#array and a 4k page of COW'd memory get copied. Use a regex on a string that alters its pos, and another 4k page gets copied. Even if the string is only 4 characters. Use a single scalar, instantiated as a number, in a string context, and another 4k page gets copied. Use each, keys, or values on a hash, and (at least) another 4k gets copied.

        These piecemeal requests for another single page of VM followed by a 4k copy through ring 0 add up to far more than an up-front single request for as many pages of VM as are required followed by a single ring 3 rep mov.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.