Proportional distribution of indivisible items

anonymized user 468275 has asked for the wisdom of the Perl Monks concerning the following question:

Given an arbitrary list of input files, of varying processing times (dependent on unpredictable factors), whose start and end times can be obtained from log files. And given a known optimum number that can be run in parallel, the challenge was to find (using a Perl for quicker "time-to-market") the distribution of files into separate lists so that each list runs in parallel and its contents run in series and the difference in execution time between different lists is minimised. Clearly this falls into the category of problem that is the title of this post.

Because most recent runs are more indicative than older ones, I used the reciprocal of (update: the age of) each finish time to find the weighted average execution time for each type of input file. A reasonable but not perfect algorithm was to then sort the list by execution time descending and at each iteration place it in the list with the least total execution time so far. This looks like an intuitive winner and indeed got about 2% away from perfection on testing, but I could see by inspection that some list entries could be swapped around to produce a slightly closer result to equality (indivisibility makes exact equality highly unlikely).

So after some research it became clear that mathematicians are still struggling with this one and the last two years has seen a frenzy of research to solve it theoretically. They seem to be trying to avoid linear programming and find an analytical solution, so far without success. My gut feeling is neural network, but before I veer off in that direction, I wonder if anyone has a better or clarified idea.

One world, one people

Comment on Proportional distribution of indivisible items

Replies are listed 'Best First'.
Re: Proportional distribution of indivisible items by roboticus (Chancellor) on Aug 14, 2018 at 16:58 UTC
Have you considered processing the files in parallel using a demand-based algorithm rather than trying to predict their runtime? For example, have X processes, and each time one of the processes finishes, it can simply consume one of the remaining files? I've had success with it in the past. I generally do it something like: `while (1) { # Are we supposed to pause or stop? if (-E "/tmp/pause") { sleep 10; next; } last if -E "/tmp/STOP"; # Find a file to process my @filist = glob(".input"); while (my $file = shift @filist) { # try to claim the file if (rename $file, "$file.working.$$") { # Successful, so go handle it process_file("$file.working.$$"); last; } # Didn't get the file (someone else might've claimed it just # as we tried to), so just loop to the next one... } }` [download] The rename is essentially to let the OS handle the task of serializing the processes: on all operating systems I generally use, renaming (on a local filesystem) is considered to be an "atomic" operation, so only one process would "win" the file, and other processes would have to go and try the next one. The advantages of demand-based sharing are: Bad runtime predictions won't cause task starvation It balances itself automatically, so it's easy to add or remove processes based on system load. ...roboticus When your only tool is a hammer, all problems look like your thumb.*	[reply] [d/l]
Re^2: Proportional distribution of indivisible items by anonymized user 468275 (Curate) on Aug 15, 2018 at 15:38 UTC
It seems an interesting idea to accept them from a queue like 5 government workers pressing the button for the next ticket number held by "customers". It would certainly mean the system is sensitive to processes completing earlier than predicted. But in this context, the first available worker at any time cannot always optimally take the largest process in the queue. For example, in the simplest case of two workers with three items left in the queue, when one becomes available 3 minutes ahead of the other and the remaining process times are estmated at 8, 5 and 4 minutes; taking the 8 minute one would leave the other worker with 12 minutes (update 3 to be free plus the 5 and 4 remaining) to your 8 whereas taking the 5 minute one would leave him 11 (3+8) minutes to your 9 (5+4). So yes, I see benefit in dynamic selection although it doesn't actually remove the problem of optimal picking (based on averaged runtimes and best fit) from what is remaining to be processed, but it does add an~~other~~ independent optimisation to the best solution. One world, one people	[reply]
Re: Proportional distribution of indivisible items by Eily (Monsignor) on Aug 14, 2018 at 15:52 UTC
I'd say partition problem (optimization variant with N partitions, ie the worst kind) rather than Knapsack, but otherwise the conclusion is pretty much the same. This means that while finding the best solution is easy (test all combinations, keep the best one), doing so in reasonable time is worth 1 million. Your "intuitive winner" looks like a pretty good approximation. For what I understand of neural networks, I don't think they are relevant for this kind of problems. This is neither extracting information (the data size of the output, ie an ordering of your elements, is the same as the input), nor a classification problem (an element isn't sent to a set according to its properties alone, but to the properties of all the elements in the input).	[reply]
Re^2: Proportional distribution of indivisible items by anonymized user 468275 (Curate) on Aug 16, 2018 at 09:27 UTC
Yes it does look like a partitioning problem but for reals not just integers. The best solution is indeed supposed to be something similar to a neural network, but so far the example solutions I found which build a tree and walk it (my earlier idea was to have N tree-walkers where N is also the number of workers when parallel processing of the jobs begin) seem to have errors in the method. My latest idea taken from various replies and munged/improved, is to start by putting the biggest five each in their own child process (that first step is obviously right though difficult to prove) and have each worker, on finishing a job, solve the remaining tree for what to take next. The dynamic solution of only 25 processes takes no time compared with the heavy jobs being performed. One world, one people	[reply]
Re^3: Proportional distribution of indivisible items by LanX (Saint) on Aug 16, 2018 at 10:54 UTC
> My latest idea taken from various replies and munged/improved, is to start by putting the biggest five each in their own child process (that first step is obviously right though difficult to prove) Probably because it's obviously wrong? 5+5+2 =12 3+3+3+3=12 > have each worker, on finishing a job, solve the remaining tree for what to take next That's the best approach if you can avoid race conditions of two workers picking the same job. > it does look like a partitioning problem but for reals not just integers. All floats are effectively only integer approximations of real numbers. So? > The best solution is indeed supposed to be something similar to a neural network, By whom? Source? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
Re^4: Proportional distribution of indivisible items by anonymized user 468275 (Curate) on Aug 16, 2018 at 11:48 UTC
Re^5: Proportional distribution of indivisible items by LanX (Saint) on Aug 16, 2018 at 13:17 UTC
Some notes below your chosen depth have not been shown here
Re: Proportional distribution of indivisible items by atcroft (Abbot) on Aug 14, 2018 at 15:36 UTC
What you describe sounds like a variation on the Knapsack Problem (but I am guessing you already knew that). I have not kept up with the state-of-the-art in dealing with that problem (or any NP-complete problems, actually), but that would be where I would start for theory. In practice, if you are within a single-digit percentage of best results, I would leave it there unless your working set reaches a size where that difference becomes problematic. Hope that helps.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


Think about Loose Coupling
	PerlMonks