Looking for a resource management / job queue module

elTriberium has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

This question is not completely Perl-focused as I would be fine with a non-Perl solution, too. However, since our whole setup is currently based on Perl a Perl-based solution would be ideal.

We currently have an environment where we run automated test cases on a combination of Linux servers (S), Linux clients (L) and Windows clients (W). Each individual test might require 0, 1 or more of S, L and W. We have to guarantee that no other tests will run on the same systems at the same time

Currently we have pre-allocated testbeds for several test runs (nightly regression tests and such), but that leaves these systems idle for a long time (a nightly run typically takes ~12 hours, so the systems are idle for the other 12 hours every day). Also, it's making it hard for users to just quickly run some tests as they need a dedicated testbed.

So what we're looking for is a system that can manage these (S, L, W) resources and reserve them for the individual test runs. It should also support a job queue so that a job will be queued when the required resources are currently not available

Once the resources are available it should "reserve" them (or just mark them as reserved in a database) and provide them to our in-house Perl-based tool that launches the tests on them (our tool connects via SSH and Telnet, so that part doesn't need to be included). Once our tool finishes the system I'm looking for should mark the resources as "unreserved" and put them back into the resource pool.

We already looked at many existing solutions, but most don't look completely sufficient:

TheSchwartz CPAN module: Doesn't seem to provide a resource reservation system
POE::Component::JobQueue: Looks like it only supports individual workers (clients), but not a combination of (S, L, W) as mentioned above
Condor and similar Grid management tools: Seem like overkill and also they expect to run the actual jobs on the individual nodes, which is not our use case

My question is if something like this is already available? Is anybody doing something similar? I don't expect it to be such an uncommon environment where we have multiple systems and need to reserve them to run tests on them. If there are multiple modules that I can combine (e.g. TheSchwartz and some resource manager) then that would also be fine. I'd appreciate any help!

Comment on Looking for a resource management / job queue module

Replies are listed 'Best First'.
Re: Looking for a resource management / job queue module by Marshall (Canon) on Jul 21, 2012 at 15:33 UTC
This could get to be pretty complicated if you require a fancy scheduling algorithm - but it could be that something fairly simple would work well enough to get started. It sounds like you already have the concept of a central "cop" program that starts these various tests and you need a resource manager for it to keep track of the resources. You could use a DB to track resources, but one common way is to create a series of zero or one byte files, each file representing one of the resources. The resouce is in use if the "cop" program can acquire an exclusive lock (write lock) on the file. Release the lock when the test is over. If the "cop" program dies, all the locks are released (a file lock is a memory resident structure - not something on the disk). This way you don't have to clean up a DB on a restart. Your Perl program keeps a table of who is using what. The hard bit is say test1 uses a couple of resources, test2 needs them all, test3 uses couple of the resources (although different ones than test1). If you want test1 and test3 to run in parallel, and then run test2, that requires "more smarts" than just running down the queue sequentially and waiting until resourses are available for the next test. If the queue order was different (test1 test3 test2) then a simple algorithm would run 1,2 together and then run 3 once both 1 and 2 had finished. How "smart" the scheduler needs to be depends upon the job mix and other factors (like how important maximal efficiency is and how long these various tests run). Maybe some of the tests that only need a couple of resources run a long time and the one that needs them all is fast - I don't know. Sorry if this wasn't much help, but maybe you will get some ideas. You could "roll your own" simple manager and just see how well (or not well) it works out in practice. The job queue could just be a "drop directory" with files that describe the jobs. Try FIFO first and see how it works out. Increase complexity as needed. Sorry that I am not aware of a CPAN module that would do this all - but that doesn't mean that such a thing doesn't exist! Maybe there is some way so that your simple resource control's simple "enough resources now, y/n?" can be combined with an existing module. I presume that would have the effect of running jobs that require fewer resources at a "higher priority" than ones that require more? Any way I recommend starting simple and measuring how well it works. "reserving" some of the resources in advance without being able to acquire all the resources at the same time can lead to "deadlocks". Sorry if I wasn't more help. The general problem for maximal efficient use of resources is difficult (at least for me). But I am hoping that something simple will "move the ball forward" and perhaps even allow developer's to inject other tests into the nightly run's mix of regression tests (software folks are known "night owls").	[reply]
Re^2: Looking for a resource management / job queue module by elTriberium (Friar) on Jul 23, 2012 at 17:47 UTC
Thanks, this was helpful. I'm thinking about writing this myself, but there are a lot of corner cases to take care of (what if a resource goes down / is reserved by someone else? What if a job never finishes? What if I need to scale this up and support multiple "job submit nodes"?) That's why I was hoping for an existing solution. There are a lot of Grid schedulers (Condor, Sun Grid Engine forks, Torque, etc.), but the problem I see with most of them is that they operate under the assumption that they control the actual jobs and start / stop the individual processes. That's not the case in our environment where we already have the "control job" (basically a customized version of the TAP::Parser module).	[reply]
Re^2: Looking for a resource management / job queue module by renormalist (Sexton) on Oct 23, 2012 at 14:14 UTC
It sounds like the perfect use-case for Tapper. There we have a scheduler that maintains HOSTS and QUEUES. Queues usually mean a test use-case (like "linux-stable", "linux-rc", etc.). You put test requests into a queue inclusive some "requested host features" spec, let the scheduler decide which queue next to choose per bandwidths and available hosts. Test requests can "re-queue itself" to create a continuous rotation of the use-cases. Setting up Tapper with all features (as used in the OSRC where we set up machines from scratch to with other distributions and Xen/KVM setups) can be a bit tricky but you seem to be ok with using ssh. See http://renormalist.net/misc/ for public material about it. Tell me if you already found another solution. Else I could help you set up a Tapper instance step by step.	[reply]
Re: Looking for a resource management / job queue module by Anonymous Monk on Jul 21, 2012 at 16:55 UTC
This is a finite-domain problem. Look at Gnu Prolog...	[reply]

Back to Seekers of Perl Wisdom