Re: best strategy

There are critical questions you have not answered. What operating system are you running on? (Any form of *nix would make fork work well. Windows does an emulation which could make that solution significantly worse.) What are you expecting to be your performance bottleneck? (CPU? Disk? Network delays?) What kind of hardware are you working on? (Number of CPUs? Number of disks?) Are there significant initialization costs? (eg Database connections cannot be preserved across a fork, and are expensive to create.) How much data needs to be passed around? Is there any possibility of moving this to a cluster?

For an extreme example, if you're using Windows and are expecting to bottleneck on local CPU on a 1-CPU machine, you absolutely should make this job a single process, that is single-threaded.

Suppose that you're bottlenecked on network time delays and there is an Oracle database connection needed per worker. Then you really want several persistent workers. Single process, multiple threads would beat constant forking.

Suppose that you're bottlenecked on disk seek time, you're on a Unix system, and there are no startup costs. Then I would recommend the fork approach.

Suppose that you're bottlenecked on ~~network round trips~~ CPU and there is a possibility of throwing multiple machines at the problem. Then I'd recommend neither of your approaches. Instead I'd look for a way to farm out jobs to multiple processes on multiple machines. One approach is to use a standard clustering solution. A very cheesy approach that I must admit to having used in the past is to make the job run in a webserver, and then use a load balancer to distribute requests. (Hey, I had the webservers already set up and sitting there mostly idle...) Another interesting approach is to have a database table with a table for open jobs. Then have workers on multiple machines query it. (I set up a batch processing system on this principle and it worked well. It was suggested to me by a former boss who had set up a swaps trading system on the principle, with some of the "workers" for some types of jobs really being people.)

Every one of these solutions and more have been successfully used. Every one has advantages and cases where it is best. Anyone who gives you an absolute answer saying that one of them is always the right way to go doesn't know what they are talking about.

I didn't really answer your question. But hopefully I gave you enough to think about that you can have a better chance of coming up with the right solution for your situation. Oh, and I gave you a few more options to consider. :-)

Update: I messed up one of my examples. If you're bottlenecked on network round trips then a single machine should be able to run enough copies to move the bottleneck to the server on the other end. In which case there is no need to complicate things with the cluster. But if CPU is your problem then you would want to split up work onto multiple machines.

Comment on Re: best strategy

Replies are listed 'Best First'.
Re^2: best strategy by libvenus (Sexton) on Aug 25, 2008 at 07:37 UTC
What operating system are you running on ? unix flavour What are you expecting to be your performance bottlenecks? processing speed and Memory utilization What kind of hardware are you working on? minimum CPUS available 4 max - 12 Are there significant initialization costs and How much data needs to be passed around i have to read many queries which are into very big files around 500 in no.The output of queries can also be bulky.Then i need to compare them.Maximizing speed with minimum memory overhead is what i m trying to achieve Is there any possibility of moving this to a cluster? not sure right now... Well i have received some valuable advice from various monks in the thread " Problem in Inter process Communication" though i still cannot decide	[reply]

Replies are listed 'Best First'.

Re^2: best strategy
by libvenus (Sexton) on Aug 25, 2008 at 07:37 UTC

What operating system are you running on ?

unix flavour

What are you expecting to be your performance bottlenecks?

What kind of hardware are you working on?

minimum CPUS available 4 max - 12

Are there significant initialization costs and How much data needs to be passed around

i have to read many queries which are into very big files around 500 in no.The output of queries can also be bulky.Then i need to compare them.Maximizing speed with minimum memory overhead is what i m trying to achieve

Is there any possibility of moving this to a cluster?

not sure right now...

Well i have received some valuable advice from various monks in the thread " Problem in Inter process Communication" though i still cannot decide

[reply]