Now it seems the worst cause of bloat are the modules copied for each thread. I have minimized the number of use'd modules, but they still amount to about 5 MB per thread. With 200 threads this uses half the available 32-bit address space...
5MB per kernel thread is hardly huge for an interpreted language.
You made it clear in your OP that you don't want to hear this which is why I haven't responded directly earlier, but your memory problems are a direct result of the bad architecture of your code.
Simply stated, there is no point in running 200 concurrent threads. You mention LWP, so I infer from that your program is doing some sort of downloading. If you have a 100mb/s connection, by running 200 concurrent threads, you've effectively given each thread the equivalent of a dial-up connection. So, in addition to chewing up more memory, throwing threads at your download task simply means that each thread takes longer and longer to complete its individual download.
In addition to your listed alternative possibilities of LWP::Parallel and HTTP::Async, there is another. That of using a sensible number of threads in a thread pool arrangement. Which for this type of IO-bound processing usually works out to be somewhere in the order of 2x to 5x times the number of cores your box has. For today's typical hardware that means 8 to 20 will give you the best throughput.
Using your own numbers, that will need < 100MB of memory, leaving at least 1.9GB available for your data.
In reply to Re^2: Sharing large data structures between threads
by BrowserUk
in thread Sharing large data structures between threads
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |