Thanks, this was helpful. I'm thinking about writing this myself, but there are a lot of corner cases to take care of (what if a resource goes down / is reserved by someone else? What if a job never finishes? What if I need to scale this up and support multiple "job submit nodes"?) That's why I was hoping for an existing solution.
There are a lot of Grid schedulers (Condor, Sun Grid Engine forks, Torque, etc.), but the problem I see with most of them is that they operate under the assumption that they control the actual jobs and start / stop the individual processes. That's not the case in our environment where we already have the "control job" (basically a customized version of the TAP::Parser module).