Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All:
I come to you seeking advice with no code and only a little bit of research. Some of you who have heard me in the CB know how much I detest coding anything WWW/HTTP/CGI/web related. My next project will push my abilities in these areas to their limits.

Overview: The code will take items off a work queue, post the information to an external web form, put the results in an out queue for further processing (by another process). The work queue will never become empty as new items from earlier on in the assembly line will continuously be added. I am fairly sure WWW::Mechanize is the one of the right tools for the job.

One of the first things I am not sure about is how to handle the work queue:

Ideally, I would want it to be smart enough to start up new processes when it determined it was not keeping up with the work queue, kill off workers when they were not needed, and also be aware of system resources so that it would not bringing the system to its knees.

Error handling is also a concern. First, I want to be sure there is a good locking mechanism so two workers do not attempt to work on the same item from the work queue. Additionally, if there is a problem with the form processing I want it to be re-tried a configurable number of times before being placed in a "bad queue".

I also have no idea what my work queue should look like. Beyond a single flat file, very simplistic database, or a queue directory where each file is a work item - I don't have the first clue as to what to use.

What I am looking for is working code that addresses one or more of my concerns. I am quite capable of splicing together what I need to meet all my requirements or figuring out how to modify the code to suit my needs. I understand it is typically hard to help someone who doesn't provide sample data and desired output. It is the framework itself that is my problem - I can handle the data munging myself.

Thanks in advance for your time and consideration in this matter.
L~R

  • Comment on Managing a web form submission work queue

Replies are listed 'Best First'.
Re: Managing a web form submission work queue
by Corion (Patriarch) on Mar 29, 2004 at 14:49 UTC

    When faced with anything that looks like a pipeline process, I like to split up the parts into many small parts, and keep each one as simple as possible.

    It seems to me that you have three steps:

    1. Procure the incoming data
    2. Post the data on the external forum
    3. Put the data into the processing queue for the next process

    As long as performance permits it, I would use only one process for one step, as multiple processes will give you the headaches of concurrency.

    If you have a scheme of proper file locking (as it is easily available under Win32, and not-so-easily-but already demonstrated here under Unixish filesystems), you can use a separate process for each step, which makes restarting certain items much easier. Then, each item becomes a file which is moved from directory to directory as it progresses through the stages of your pipeline. Status queries are then reduced to the task of finding where a file resides in the directory tree and to checks that no file should be older than (say) 5 minutes in any of the intermittent directories.

    If you have no way of proper locking through files, a database supplies you with easy concurrency and proper locking. Put your data in a table row, together with a status column and all processes can even easier manipulate the data. I would still restrict input to one process to avoid feeding duplicates, but if you construct your SQL and the status properly, you can have as many processing items as you wish/your system allows. Status queries are then simple SQL, but taking an item out of the processing pipeline requires setting the status instead of moving the file - this may or may not be a problem for bulk changes, depending on how much access you have to the live system.

      Corion,
      I am responsible for only one piece in the pipeline. The piece I described. The process that will be filling my work queue and the process whose work queue I will be filling is beyond my control. I do have a bit of say as to how my work queue gets filled just as I have to conform to the requirements of the work queue I will be filling.

      My preliminary tests and estimated rate of incoming jobs leads me to believe single threaded serial processing will be insufficient. That isn't to say I have missed something where I could improve efficiency, but I really do want to have my options open.

      I like the simplicity of 1 file per job and using directories as queues. I am not sure if there will be a "reporting" requirement in the future, but I can bet if I don't plan for it there will be. That leaves a couple of options as I see it:
      • Write a transaction log
      • Use a database where the stats can be constructed (time in, time out, number of attempts, etc)
      Any code snippets will be appreciated. Thanks again.
      L~R
Re: Managing a web form submission work queue
by perrin (Chancellor) on Mar 29, 2004 at 15:08 UTC
    It's not nearly as bad as it sounds. You can use Parallel::ForkManager to handle the parallelism. Don't get hung up on dynamically managing the number of workers -- it should be good enough to have a limit. Use a relational database to store the queue. Have a status field for each job that you can switch between "new", "in-progress", and "complete." When a worker takes a new job from the queue, change its status, and use locking to prevent two workers from grabbing the same job. When a job completes, write the result back into the database and update the status again. Hopefully that's enough to get you started.
Re: Managing a web form submission work queue
by Ovid (Cardinal) on Mar 29, 2004 at 15:15 UTC

    Aside from the other suggestions, might I recommend a serializable workflow object? You would basically draw a work flowchart with a beginning and an end. Each stage is a different "state" in the workflow with a proper commit/rollback procedure. I think Pixie might be great for something like this. Your program could load all workflow objects and either pass each object to a different thread/POE process or just load them sequentially if the load was light enough.

    For the "post to form" job, the workflow object would attemp to post the data and, if failing, move itself to the bad queue. For that, you might want a single workflow collection object. You only need (hopefully) one instance and use the flyweight pattern to identify the workflow objects in various "in", "work", "out", and "bad" queues. (I'm just guessing about the flyweight pattern because I think it might make it easier if you're running multiple threads or have workflow objects on another marchine).

    Conceptually I think it's a pretty clean model and hopefully would accurately reflect your business needs. It also makes locking simple if you have a single collection object assigning tasks to the workflow objects (theoretically).

    Cheers,
    Ovid

    New address of my CGI Course.

Re: Managing a web form submission work queue
by matija (Priest) on Mar 29, 2004 at 14:45 UTC
    Considering you want locking in your queue (so that only one worker can get assigned a piece of work, even if there are two processes trying to assign something), I strongly recommend that you use a database for the queue.

    A way ensure that, even if the database doesn't allow atomic operations is to have a field (call it lock (int)) in your row. When you want to select that row, do a

    update table set lock=$$ where id=foo and lock=0
    Then you select that row, and examine the lock. If it is $$, then your process has the lock. If it is something else, another process swooped in at the last moment, and it has the lock. If it is 0, then you have a serius problem :-).

Re: Managing a web form submission work queue
by waswas-fng (Curate) on Mar 29, 2004 at 14:55 UTC
    L~R: I would think the two best forms of queues are databases and queue directories. In your case a directory system such as one that sendmail uses may be very valuable. It will let you lock and manage jobs, you can have different level of queue directories such as:

    /var/spool/prog/queue-1 /var/spool/prog/queue-2 /var/spool/prog/queue-bad /var/spool/prog/queue-lowpriority
    and have the queue runners on each queue manage the flow of jobs between queues. You can bulk up the number of queue runners for a particular queue if the load jumps, and set different priorities for previously failed jobs. Locking can be flock based or child file based (like sendmail -- flock is great for 1 step queue processing, file based is ice if the process has multi steps and needs to be inspected on system failure or reboot before the queue file is running again). You can do the same thing in a database, and it may more sense to -- depending on how the jobs are gathered. Either way, you know what to watch for -- you need tight file locking to avoid races.


    -Waswas
Re: Managing a web form submission work queue
by knowmad (Monk) on May 10, 2004 at 21:15 UTC

    I'm glad to have found this discussion as I am facing a similar issue. It seems the consensus for managing work-in-progress is via a status field in the work queue database. That's the approach I was considering so am glad to see it validated.

    However, none of the posters who suggested this method gave any input about how to recover from error conditions. How does the script know if a record marked as 'in-progress' is actually being worked, is stalled (because the user took a lunch break), or needs to be returned to the queue because of a system crash, user error or other anomaly?

    In my case, each user has an id which I plan to use for checking if there is already a record being worked by this user. It won't catch stalled records if the user doesn't return which is not ideal. I could get around that by adding an additional expiration check for nn minutes/hours/days. Any other suggestions/comments would be most appreciated.

    Thanks,
    William

      knowmad,
      I ended up using Parallel::ForkManager. It offers a few callbacks that can hook to custom code to verify this. Since this is a application specific type of verifiecation, I think the way it was abstracted in the module was perfect for my needs.

      Hope this helps - L~R

        L ~ R,

        Thanks for your reply. I looked into Parallel::ForkManager but don't really see how you are using it to manage a work queue for a web application. Do you keep a forked process open until the user submits the form? Which callbacks are you using?

        Thanks for your feedback,
        William