Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow Monks,

I have a general implementation/design question that I has been troubling me. We are setting up a web-site that will provide a free resource to researchers in our institution. There are three parts to this:

  1. Front end: CGI, accepts and verifies parameters
  2. Wrapper: call external program to do "data processing"
  3. Results display: CGI to format the data

The part that is troubling me is #2. The data-processing can be pretty computationally intensive. I would prefer to "queue" requests, so that they get processed in serial rather than in parallel. It seems that doing things that way would break the underlying model.

The best option I see is as follows:

  1. Front End
  2. Write parameters to text
  3. Check if results finished
  4. Display results

Once the parameters are written to text, a cron job (okay, this is on WinXP so I guess it would have to be a service?) regularly checks the textfile to see if there are new records to be run. If so, it processes them and dumps the results to a text-file.

I'm not a huge fan of the above strategy because, amongst other things, it makes me go and rewrite the wrapper, which is already written in a standalone, non-CGI version. Are there any other standard solutions to this kind of problem?

Replies are listed 'Best First'.
Re: Queuing Input for Serial Processing
by bean (Monk) on Aug 25, 2003 at 18:29 UTC
    For such a resource-intensive application, one solution is to email the results to the users - emails are less ephemeral than web-pages and so your users would be less likely to waste resources by recalculating a previously generated set of results.

    By the same token, if your data changes relatively slowly, you could cache results keyed off a parameter/date combination - then you could just use the cache if asked for a result set you've already calculated (this hinges on being able to tell if the data has changed since the result was generated).

    If you cache results, sending the results as emails no longer frees up resources, but your users will probably like it better anyway. Even 15 seconds is a long time to wait for a webpage to load - whereas getting an email in 60 seconds feels fast. It's all a question of expectations - people tend to wait for webpages, but will go about their business if told the results will be sent to them. Plus, I think an email is a more useful form - if the user wants to send the results to someone else, they just forward the email, instead of cutting and pasting...

    BTW, a daemon might be better than a cron job because even if the cron job runs once a minute you're adding an average of 30 seconds delay to one minute of processing (selecting is better than polling).
    Update
    Obviously, caching is only worth implementing if users are likely to submit the same parameter values before the data changes.
Re: Queuing Input for Serial Processing
by LameNerd (Hermit) on Aug 25, 2003 at 17:33 UTC
    I am curious to why you would want to process things in serial manner. I would think, in general, that processing things in parallel is better. Better meaning more effienct use of computing resources. You can always process the results in parallel yet still only allow access to the results FIFO. Or in other word you process things in parallel yet make it look like things where processed serially if you have to.
    updated
    BTW you might be able to get cron to work on your Windows XP system. I have used cygwin's port of cron successfully on my NT system.

      Hiya,

      The primary reason is that three or four concurrent requests would be enough to get the computer thrashing virtual memory. The computations being run takes about 700 MB of RAM and about 60 seconds CPU time. The computer can handle maybe two at once. With more than that, they all slow down dramatically.

        Well if you where to develop your software in such a way that it processed requests in parallel but also limited the number of requests that could be processed at the same time, then the number of request that could be processed would be stored in a site specific configuration variable. On your current hardware you would set this parameter to 1 and if you where ever to get a better box you would only have to change this parameter instead of rewriting your code.