Re^2: Sharing large data structures between threads

Now it seems the worst cause of bloat are the modules copied for each thread. I have minimized the number of use'd modules, but they still amount to about 5 MB per thread. With 200 threads this uses half the available 32-bit address space...

5MB per kernel thread is hardly huge for an interpreted language.

You made it clear in your OP that you don't want to hear this which is why I haven't responded directly earlier, but your memory problems are a direct result of the bad architecture of your code.

Simply stated, there is no point in running 200 concurrent threads. You mention LWP, so I infer from that your program is doing some sort of downloading. If you have a 100mb/s connection, by running 200 concurrent threads, you've effectively given each thread the equivalent of a dial-up connection. So, in addition to chewing up more memory, throwing threads at your download task simply means that each thread takes longer and longer to complete its individual download.

In addition to your listed alternative possibilities of LWP::Parallel and HTTP::Async, there is another. That of using a sensible number of threads in a thread pool arrangement. Which for this type of IO-bound processing usually works out to be somewhere in the order of 2x to 5x times the number of cores your box has. For today's typical hardware that means 8 to 20 will give you the best throughput.

Using your own numbers, that will need < 100MB of memory, leaving at least 1.9GB available for your data.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^2: Sharing large data structures between threads

Replies are listed 'Best First'.
Re^3: Sharing large data structures between threads by Anonymous Monk on Mar 08, 2011 at 12:48 UTC
If you have a 100mb/s connection, by running 200 concurrent threads, you've effectively given each thread the equivalent of a dial-up connection. If all the threads were doing active work (like downloading), I would have no problems. The problems arose when some servers time out on me. I must use long timeouts (at least 5 minutes), and I have to retry the operations with not too much of a delay. This is what I have tried to counter by running a larger number of threads (100 is the maximum for now). That of using a sensible number of threads in a thread pool arrangement. But then I cannot use LWP. In fact, I do use a thread pool at the moment, my worker threads can pick up any kind of work (LWP fetch or a database fetch). The current worker thread code is very straightforward because it blocks on all I/O. To do non-blocking I/O, I will have to process several work items simultaneously, making the process quite messy. But I understand I have to have to go that route.	[reply]
Re^4: Sharing large data structures between threads by BrowserUk (Patriarch) on Mar 08, 2011 at 13:43 UTC
I must use long timeouts (at least 5 minutes), and I have to retry the operations with not too much of a delay. In this case I would probably use a pool of n-cores LWP::Parallel::UserAgents feeding off a common queue. In this way, you can potentially make full use of both your cores and your bandwidth, whatever combination of instantaneous loads your program finds itself having to deal with. I fully appreciate your preference for the simple linear flows of blocking IO architectures, but LWP::Parallel::UserAgent's 3 callbacks make for a reasonably sane and manageable alternative. A couple of speculations from reading between the lines of your posts: It sounds like you're mixing DB fetches and LWP fetches in the same thread routines? If so, I'd suggest that you seriously consider separating those two pieces of functionality. It sounds like you have your workers scanning (polling) your large, shared, data-structure looking for work? This is almost always a bad idea. (Indeed, large shared data structures are usually a bad idea full stop, but I digress.). In order for there to be a work item in your large shared data structure (LSDS), something has to write it there. And then your workers have to trundle around that LSDS re-reading the same data over and over, trying to recognise what is new, and what of what is new, is an as yet unactioned work item. And of course, as you have many workers, you also have to deal with the possibility that two or more workers may happen across the same new work item concurrently, so you need a mechanism, probably locking, to prevent them from trying to action the same work item multiple times. And even if you've adopted some non-locking mechanism for duplicate prevention, you still have to use locking for whatever is writing this new data to the structure, and that just creates a bottleneck that can seriously hamper throughput while all the readers wait while the writer(s) write. It is almost always better to have the code that writes the information to that structure recognise when it is about to write a new work item to the LSDS, and have it queue that work item to a shared queue read by all the worker threads. This way, the workers don't have to waste time searching for something to do--including the high probability that they 'just miss' a new work item on one pass and have to start over again looking for something, only to have another worker find that work item and so they waste another pass. This scenario often leads to some workers doing nothing but poll the LSDS but never actually finding anything useful to do. And the more workers there are, the higher the probability of that happening. With the work item queue, when a thread is ready to do something, it just reads the next item from the queue. If there is nothing there, it just blocks, consuming no resources until something becomes available and ensuring that it doesn't hamper--by locking the LSDS; or uselessly chewing valuable cpu cycles--the other workers that do have useful work to do. And of course, once you have the writer(s) perform the work item recognition and queuing, then the need for the LSDS becomes redundant, and so you free up that memory, and avoid the overheads of locking. Your application sounds like an interesting problem to solve, and given sufficient information to allow me to avoid having to speculate about it, I would enjoy considering the possibilities available to you. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]