in reply to Sharing large data structures between threads

Thank you all for your replies and discussion!

I went ahead and implemented the one-thread data management queue. Much to my surprise, however, it didn't result in any perceived change in memory usage! It looks like Perl does some intelligent things in copying shared structures, probably something like Copy-On-Read.

Because of this, I then expected my long-term memory usage to go down. But after 10 hours of run-time that didn't seem to happen either...

I assume my usage pattern of the structure made it resistant to memory bloat. The structure is a kind of work queue that threads take work items from. They go through the structure periodically to check for valid elements, but this accesses only a small part of the data. When the rest of the data is read, it is discarded shortly after.

Now it seems the worst cause of bloat are the modules copied for each thread. I have minimized the number of use'd modules, but they still amount to about 5 MB per thread. With 200 threads this uses half the available 32-bit address space...

Optimizing modules goes out of the scope of my question, so I'm not going to dwell on that further. Suffice it to say that I have options there as I am using LWP::UserAgent which can be replaced by other modules like LWP::Parallel or HTTP::Async or even my own socket code.

  • Comment on Re: Sharing large data structures between threads

Replies are listed 'Best First'.
Re^2: Sharing large data structures between threads
by BrowserUk (Patriarch) on Mar 08, 2011 at 12:25 UTC
    Now it seems the worst cause of bloat are the modules copied for each thread. I have minimized the number of use'd modules, but they still amount to about 5 MB per thread. With 200 threads this uses half the available 32-bit address space...

    5MB per kernel thread is hardly huge for an interpreted language.

    You made it clear in your OP that you don't want to hear this which is why I haven't responded directly earlier, but your memory problems are a direct result of the bad architecture of your code.

    Simply stated, there is no point in running 200 concurrent threads. You mention LWP, so I infer from that your program is doing some sort of downloading. If you have a 100mb/s connection, by running 200 concurrent threads, you've effectively given each thread the equivalent of a dial-up connection. So, in addition to chewing up more memory, throwing threads at your download task simply means that each thread takes longer and longer to complete its individual download.

    In addition to your listed alternative possibilities of LWP::Parallel and HTTP::Async, there is another. That of using a sensible number of threads in a thread pool arrangement. Which for this type of IO-bound processing usually works out to be somewhere in the order of 2x to 5x times the number of cores your box has. For today's typical hardware that means 8 to 20 will give you the best throughput.

    Using your own numbers, that will need < 100MB of memory, leaving at least 1.9GB available for your data.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      If you have a 100mb/s connection, by running 200 concurrent threads, you've effectively given each thread the equivalent of a dial-up connection.

      If all the threads were doing active work (like downloading), I would have no problems. The problems arose when some servers time out on me. I must use long timeouts (at least 5 minutes), and I have to retry the operations with not too much of a delay. This is what I have tried to counter by running a larger number of threads (100 is the maximum for now).

      That of using a sensible number of threads in a thread pool arrangement.

      But then I cannot use LWP. In fact, I do use a thread pool at the moment, my worker threads can pick up any kind of work (LWP fetch or a database fetch). The current worker thread code is very straightforward because it blocks on all I/O. To do non-blocking I/O, I will have to process several work items simultaneously, making the process quite messy. But I understand I have to have to go that route.

        I must use long timeouts (at least 5 minutes), and I have to retry the operations with not too much of a delay.

        In this case I would probably use a pool of n-cores LWP::Parallel::UserAgents feeding off a common queue. In this way, you can potentially make full use of both your cores and your bandwidth, whatever combination of instantaneous loads your program finds itself having to deal with.

        I fully appreciate your preference for the simple linear flows of blocking IO architectures, but LWP::Parallel::UserAgent's 3 callbacks make for a reasonably sane and manageable alternative.

        A couple of speculations from reading between the lines of your posts:

        • It sounds like you're mixing DB fetches and LWP fetches in the same thread routines?

          If so, I'd suggest that you seriously consider separating those two pieces of functionality.

        • It sounds like you have your workers scanning (polling) your large, shared, data-structure looking for work?

          This is almost always a bad idea. (Indeed, large shared data structures are usually a bad idea full stop, but I digress.).

          In order for there to be a work item in your large shared data structure (LSDS), something has to write it there. And then your workers have to trundle around that LSDS re-reading the same data over and over, trying to recognise what is new, and what of what is new, is an as yet unactioned work item.

          And of course, as you have many workers, you also have to deal with the possibility that two or more workers may happen across the same new work item concurrently, so you need a mechanism, probably locking, to prevent them from trying to action the same work item multiple times. And even if you've adopted some non-locking mechanism for duplicate prevention, you still have to use locking for whatever is writing this new data to the structure, and that just creates a bottleneck that can seriously hamper throughput while all the readers wait while the writer(s) write.

          It is almost always better to have the code that writes the information to that structure recognise when it is about to write a new work item to the LSDS, and have it queue that work item to a shared queue read by all the worker threads. This way, the workers don't have to waste time searching for something to do--including the high probability that they 'just miss' a new work item on one pass and have to start over again looking for something, only to have another worker find that work item and so they waste another pass.

          This scenario often leads to some workers doing nothing but poll the LSDS but never actually finding anything useful to do. And the more workers there are, the higher the probability of that happening.

          With the work item queue, when a thread is ready to do something, it just reads the next item from the queue. If there is nothing there, it just blocks, consuming no resources until something becomes available and ensuring that it doesn't hamper--by locking the LSDS; or uselessly chewing valuable cpu cycles--the other workers that do have useful work to do.

          And of course, once you have the writer(s) perform the work item recognition and queuing, then the need for the LSDS becomes redundant, and so you free up that memory, and avoid the overheads of locking.

        Your application sounds like an interesting problem to solve, and given sufficient information to allow me to avoid having to speculate about it, I would enjoy considering the possibilities available to you.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Sharing large data structures between threads
by locked_user sundialsvc4 (Abbot) on Mar 09, 2011 at 14:49 UTC

    It still disturbs me that you have two hundred threads... but, since you have specifically excluded consideration of this aspect of the system in your OP, I suppose that it cannot be reconsidered.   But if it could...

    The Windows memory-manager is known to be quite “lazy” and sometimes this characteristic produces memory-consumption that is quite larger than you intuitively suppose it should be.   I also suspect that it over-states the “amount used.”

    My instincts still tell me – nay, they fairly scream at me – that you are going to have to find a way to reduce the total number of threads that are active at one time.   It has consistently been my experience over many years that a limited (and fairly small) number of threads should be put to work against an arbitrary (and perhaps quite large) number of things-to-do.   In the Windows environments, this seems to be particularly important.

      The Windows memory-manager is known to be quite “lazy” and sometimes this characteristic produces memory-consumption that is quite larger than you intuitively suppose it should be. I also suspect that it over-states the “amount used.”

      Documentation? Evidence? Source of this unfounded rumour?

      Why do you continue to expound crap on subjects you obviously have no f***ing idea about?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.