Hi all,
I am banging my head against the wall on the following issue.
I have a Perl program that spawns 4 child processes and then waitpid for them to finish their work and exit. Up until two weeks ago everything was working fine - the child processes complete their tasks fine and then return one by one, and then the parent process exits. However, in the last two weeks one of the child processes (and which one exactly seems random) dies with an "out of memory" error. The parent process is about 100MB in size (RAM) when it spawns the 4 children. Each of the four children starts at at about 500MB (due to loading a lot of data in RAM) and then each grows to about 650MB. However, about 5min after the parent spawns the 4 children AND while they are all about 500MB in size one of the child dies and the selection of the child seems to be or no particular order. The process that croaks does process some of its assigned tasks before crashing and the data processed is not really the problem. I tried executing all the tasks that are split among the 4 processes by a single process and it works just fine. So the data and its splitting is not the issue.
After a lot of searching and Googling, I noticed the fact the at the time of spawning by the parent the sum of the RAM taken by the 4 child process and the parent process is about 2GB (4*500MB + 100MB), which happens to be the process limit on a 32-bit machine. Could this be the cause or is it just a coincidence? I can't help but notice that when there was less data and the children were about 300MB in size each, everything worked just fine. Now, after one of the children croaks the other 3 continue working just fine and the parent dutifully waits for all of them and then exits without error.
If the issue is the parent-child relationship and size limit - is there a way around it? Would it help if the parent just exited and then the 4 children are adopted by init? Btw, the way I spawn process is just the fork() call from within Perl so nothing fancy in memory allocation going on here.
Any help will be appreciated.
Thx