in reply to Managing a web form submission work queue

When faced with anything that looks like a pipeline process, I like to split up the parts into many small parts, and keep each one as simple as possible.

It seems to me that you have three steps:

  1. Procure the incoming data
  2. Post the data on the external forum
  3. Put the data into the processing queue for the next process

As long as performance permits it, I would use only one process for one step, as multiple processes will give you the headaches of concurrency.

If you have a scheme of proper file locking (as it is easily available under Win32, and not-so-easily-but already demonstrated here under Unixish filesystems), you can use a separate process for each step, which makes restarting certain items much easier. Then, each item becomes a file which is moved from directory to directory as it progresses through the stages of your pipeline. Status queries are then reduced to the task of finding where a file resides in the directory tree and to checks that no file should be older than (say) 5 minutes in any of the intermittent directories.

If you have no way of proper locking through files, a database supplies you with easy concurrency and proper locking. Put your data in a table row, together with a status column and all processes can even easier manipulate the data. I would still restrict input to one process to avoid feeding duplicates, but if you construct your SQL and the status properly, you can have as many processing items as you wish/your system allows. Status queries are then simple SQL, but taking an item out of the processing pipeline requires setting the status instead of moving the file - this may or may not be a problem for bulk changes, depending on how much access you have to the live system.

  • Comment on Re: Managing a web form submission work queue

Replies are listed 'Best First'.
Re: Re: Managing a web form submission work queue
by Limbic~Region (Chancellor) on Mar 29, 2004 at 15:02 UTC
    Corion,
    I am responsible for only one piece in the pipeline. The piece I described. The process that will be filling my work queue and the process whose work queue I will be filling is beyond my control. I do have a bit of say as to how my work queue gets filled just as I have to conform to the requirements of the work queue I will be filling.

    My preliminary tests and estimated rate of incoming jobs leads me to believe single threaded serial processing will be insufficient. That isn't to say I have missed something where I could improve efficiency, but I really do want to have my options open.

    I like the simplicity of 1 file per job and using directories as queues. I am not sure if there will be a "reporting" requirement in the future, but I can bet if I don't plan for it there will be. That leaves a couple of options as I see it:
    • Write a transaction log
    • Use a database where the stats can be constructed (time in, time out, number of attempts, etc)
    Any code snippets will be appreciated. Thanks again.
    L~R