What OS? How much RAM on the machine? Do you have swap configured?
I'm going to assume a variant on Unix because one child dies and the rest continue. At a wild guess, you've got 2 GB of RAM and no swap. In which case you are running out of RAM. Adding more RAM or configuring swap should stop the problem. Having swap will slow things down a lot.
Another possible fix is to split your data into a larger number of smaller pieces, then use something like Parallel::ForkManager to process it with a fixed number of children at any time. That will give you the parallelism you're looking for while controlling how much memory you need at any one time. Make the size of the target pieces be fixed. That way as your dataset continues to grow, your memory needs will stay fixed. | [reply] |
No, RAM is not the issue. The server is a Ubuntu 8.04, has 8GB of RAM, the kernel is PAE enabled and it can see all 8GB or RAM. The swap is also not an issue - 8GB of swap.
Unfortunately I can't make the children smaller. They all load a an instance of Bayesian classifier model trained on a large data set.
The only real solution would be to write a server that loads that classifier, then launch several of those servers listening on different ports and then have the spawned children I mentioned earlier do some socket-level communication with servers in a round-robin fashion. So it's basically a way of offloading some of the data processing to separate instances and not inside the child processes that crash sometimes.
So to reiterate, you are not aware of any restrictions on parent-child memory allocation? Nothing related to values of SHMMAX or stuff like that?
The current value of SHMMAX is 32MB btw.
| [reply] |
I am not aware of anything like that. That isn't to say that there isn't a limit that I am not aware of though. I am neither a sysadmin nor an expert on Linux internals. (However googling for SHMMAX, that should be entirely unrelated unless you are deliberately using shared memory.)
However one question that comes up is whether all of the children are loading the same instance of a Bayesian classifier model. If so then you can save on RAM by forking one child, having that one load the Bayesian classifier model, then having it fork itself into 4. That will result in the 4 children sharing a lot more memory. As they continue to work, some of that memory will come unshared, but it may save you overall.
Now why are you running out of memory? I don't know. In theory you have 16 GB of RAM available to you. However it is possible that other things are using most of it, or that some sysadmin has set a ulimit on how much memory the user you're running can access. Whatever the case the behavior you describe is consistent with your running out of RAM at close to 2 GB.
But that is testable. You just need to create several deliberately large processes and see where they run out of RAM.
| [reply] |
| [reply] |
I would be looking for ways to make the child processes use less memory. (Are input files being slurped when they could be processed one record at a time? Is each child making unnecessary copies of its input data, e.g. by reading a whole file into a scalar then splitting into an array? Are there complex data structures where simpler storage would do? Would it make sense to use additional disk-based resources instead of in-memory data structures, e.g. dbm files or other database(-like) storage?)
Failing that, I'd be checking whether it's really necessary to have four children running at once. What does that quantity get you that you don't get with two consecutive jobs with two children per job?
If there is just one factor that makes the difference between "it works" and "it fails", and that one factor is the size of the input files, and it turns out that these files are just always getting bigger, you've got a scaling problem, which is a sort of design problem. Anything that doesn't solve the design problem is just going to be a stop-gap, temporary fix with a limited life-span.
Solving the design problem is a matter of figuring out how to complete the task within a finite amount of ram, such that the process runs with a stable and consistent footprint no matter what size the input data may be. | [reply] |
If the problem happens "since two weeks", then something has changed in your data "since two weeks". So, if you can, get the backups for the last four weeks and do a binary search over the data to find two (possibly adjactent) datasets, one of which "works" and one which "doesn't work". Then analyze what is different between the two datasets.
| [reply] |