Bag uniform distribution algorithms

davido has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Bag uniform distribution algorithms by BrowserUk (Patriarch) on Apr 25, 2013 at 16:45 UTC
You could try something like this: `#! perl -slw use strict; use Data::Dump qw[ pp ]; sub gen { my $href = shift; my @kByV = sort{ $href->{ $b } <=> $href->{ $a } } keys %$href; my @dist = ( $kByV[0] ) x $href->{ $kByV[0] }; shift @kByV; while( @kByV ) { my $k = shift @kByV; my $v = $href->{ $k }; my $n = int( @dist / ( $v+1 ) ); my $p = $n * $v; splice( @dist, $p, 0, $k ), $p -= $n for reverse 1 .. $v; } return \@dist; } pp gen( { A => 4, B => 2, C => 3, D => 1 } ); pp gen( { A => 5, B => 4, C => 3, D => 2, E => 1 } );` [download] Outputs: `C:\test>1030688.pl ["A", "C", "B", "A", "D", "C", "B", "A", "C", "A"] ["A", "B", "C", "A", "D", "B", "C", "E", "A", "B", "D", "C", "A", "B", + "A"]` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Bag uniform distribution algorithms by kennethk (Abbot) on Apr 25, 2013 at 20:41 UTC
After mulling on this one for a while, in particular thinking about how to track when a letter was due to get output again and how to maintain a distribution, I think I have something good. Compute a rolling score for each element, where it accumulates based upon probability of output and gets dinged when its element gets selected; maximum score always gets chosen. `use strict; use warnings; use List::Util 'sum'; use Data::Dump qw[ pp ]; sub gen { my $href = shift; my %score = %$href; my $norm = sum values %score; my @series; for (1 .. $norm) { my ($max, $elem) = 0; for (keys %score) { ($max, $elem) = ($score{$_}, $_) if $score{$_} >= $max; $score{$_} += $href->{$_}; } push @series, $elem; $score{$elem} -= $norm; } return \@series; } pp gen( { A => 4, B => 2, C => 3, D => 1 } ); pp gen( { A => 5, B => 4, C => 3, D => 2, E => 1 } );` [download] outputs `["A", "C", "B", "A", "D", "C", "A", "B", "C", "A"] ["A" .. "D", "A", "B", "E", "A", "C", "B", "A", "D", "C", "B", "A"]` [download] It also naturally extends to infinite series and non-integer element counts. The infinite series case is why it's necessary to have a `>=` in the score comparison; you could alternatively initialize `$max` to something negative since the sum of all scores is necessarily zero at all times. NM complexity, where N is the length of desired series and M is number of distinct elements. This approach also lends itself to designing a metric for how smooth a series is: use strict; use warnings; use List::Util 'sum'; sub measure { my @series = @_; my %count; $count{$_}++ for @series; my $norm = sum values %count; $_ /= $norm for values %count; my %score = %count; my $metric = 0; for my $elem (@series) { $score{$elem} -= 1; for (keys %score) { $metric += $score{$_}*2; $score{$_} += $count{$_}; } } return $metric/@series; } print measure("A", "C", "B", "A", "D", "C", "A", "B", "C", "A"), "\n"; print measure("A", "C", "B", "A", "C", "D", "A", "B", "C", "A"), "\n"; print measure("A", "A", "A", "A", "B", "B", "C", "C", "C", "D"), "\n"; [download] outputs `0.37 0.37 3.45` [download] #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re: Bag uniform distribution algorithms by BrowserUk (Patriarch) on Apr 25, 2013 at 21:17 UTC
Is there a generalized solution that minimizes error if the closed list is converted to an infinite list? Given the nature of the input, how are you seeking to convert that to a specification of an infinite list? What I mean to say is that there is a fundamental conflict between "uniform distribution" and a variable length list. Using your example input, until the list reaches a length of 10, adding an 'e' will mean that 'e's are over represented; but waiting until the 10th take in order to add the 'e's, means that if the list stops there, the 'e's aren't "uniformly distributed". At least in as much as your post implies uniform distribution whereby intuitively, a single letter, should appear somewhere close to the middle of the list. There is no way to maintain that definition of "uniform distribution" whilst generating a list one element at a time. (Not even if you knew the final target length up front.) You would -- and, at best, could only -- achieve that definition of uniform distribution every mod(M: where M == sum(f₀ⁿ)) elements. If that is acceptable, you might generate a single natural length, uniformly distributed list internally, and then return that one element at a time, cyclically. The distribution will only be perfect every M takes, but it will never be grossly wrong, which meets the "minimizes error" requirement. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Bag uniform distribution algorithms by Laurent_R (Canon) on Apr 25, 2013 at 22:05 UTC
Given the nature of the input, how are you seeking to convert that to a specification of an infinite list? What I mean to say is that there is a fundamental conflict between "uniform distribution" and a variable length list. This is also what came to my mind when I read the specification. The problem is somewhat similar to data compressing algorithms, which often work on complete files and thus can make full statistical analysis of the data before starting to really encode, and others which have to work on the fly with data coming on a network, for example. I guess one way to do that is to use a sliding window mechanism, i.e. you reorganize data within a sliding window of a certain size; but whatever is no longer in the sliding window can no longer be optimized with the new data coming in. Of course, the final result is usually not as good as if the full data had been there from the onset, but you can still manage a heuristics to make things relatively close to optimal (i.e. relatively similar to what a perfect algorithm would have done with a prior knowledge of the full data set). But, of course, this can work on most usual cases, but it is also probably possible to manufacture a deviant data set where this heuristics would fail to produce good results (just as, given a compressing algorithm, it is almost always possible to produce data where the compressed result will take more place that the original one, unless of course the algorithm as an "oops, back to the original data" clause). And, of course, the size of the Window might have a considerable effect on the degree of successfulness of the heuristics. I guess that only actual test with real data can say this, it does not look as if a formal analysis can answer this question, unless possibly if we have an in-depth knowledge of the data coming in.	[reply]
Re^2: Bag uniform distribution algorithms by davido (Cardinal) on Apr 25, 2013 at 22:44 UTC
In short, yes. It's incumbent on the "user" to either know that if there are 20 items being distributed, a fair distribution can only occur at `( $n % 20 ) == 0` or to be Ok with modulo bias. And Likewise, in the case of an infinite stream, the user should either draw multiples of the size of the input lists, or be ok with the fact that as `$n` approaches infinity modulo bias fades into irrelevancy. I'm also assuming that the input lists are finite in size, so the frequency can be known. Dave	[reply] [d/l] [select]
Re: Bag uniform distribution algorithms by kcott (Archbishop) on Apr 26, 2013 at 09:02 UTC
G'day Dave, By alternatively splicing elements from either end of the distribution, I get your original '`qw( A C B A D C A B C A )`'. `$ perl -Mstrict -Mwarnings -E ' my %bag = ( A => 4, B => 2, C => 3, D => 1 ); my @distribution; for my $key (sort { $bag{$b} <=> $bag{$a} } keys %bag) { my $base_offset = int(@distribution / ($bag{$key} + 1)); my $offset = $base_offset; for (1 .. $bag{$key}) { next unless $_ % 2; splice @distribution, $offset, 0, $key; if ($_ < $bag{$key}) { splice @distribution, -$offset, 0, $key; } $offset += $base_offset + 1; } } say "@distribution"; ' A C B A D C A B C A` [download] If I round up the `$base_offset` value, the middle two elements are reversed but everything else remains the same. I don't see this as being more or less uniform but maybe it's more correct. `$ perl -Mstrict -Mwarnings -E ' my %bag = ( A => 4, B => 2, C => 3, D => 1 ); my @distribution; for my $key (sort { $bag{$b} <=> $bag{$a} } keys %bag) { my $base_offset = int(@distribution / ($bag{$key} + 1) + 0.5); + my $offset = $base_offset; for (1 .. $bag{$key}) { next unless $_ % 2; splice @distribution, $offset, 0, $key; if ($_ < $bag{$key}) { splice @distribution, -$offset, 0, $key; } $offset += $base_offset + 1; } } say "@distribution"; ' A C B A C D A B C A` [download] I haven't done extensive testing on this solution. Beyond your sample input, I tried: `my %bag = ( A => 4, B => 2, C => 3, D => 1, Z => 1 );` [download] gives: `A C B A C D Z A B C A`; and `my %bag = ( A => 4, B => 2, C => 3, D => 1, Y => 5, Z => 1 );` [download] gives: `Y A C Y B A C Z D Y A B Y C A Y`; and `my %bag = ( A => 4, B => 2, C => 3, D => 1, X => 6, Y => 5, Z => 1 );` [download] gives: `X Y A X C Y B A X Y Z D C X A B Y C X A Y X`; and `my %bag = ( A => 4, B => 2, C => 3, D => 1, W => 4, X => 6, Y => 5, Z +=> 1 );` [download] gives: `X Y A W X C Y B A W X Y Z D C X W A B Y C X W A Y X` The above 4 tests all included rounding; without rounding, the results are: `A C B A Z D C A B C A Y A C Y B A C D Z Y A B Y C A Y X Y A C X Y B A C X D Z Y X A B Y X C A Y X X Y A W C X Y B A W C X D Z Y X W A B Y X C W A Y X` [download] -- Ken	[reply] [d/l] [select]
Re: Bag uniform distribution algorithms by hdb (Monsignor) on Apr 26, 2013 at 09:16 UTC
Evaluation is key for this question. In addition to kennethk's score function I think the cumulative variance of positions is an appropriate measure. For the variance, a large value is good. use strict; use warnings; use Math::Round; use Statistics::Basic qw(:all); use List::Util 'sum'; sub dispersion { my %p; my $pos = 0; push @{$p{$_}}, $pos++ while $_ = shift; my $var = 0; $var += variance( $p{$_} ) for ( keys %p ); return $var; } sub measure { my @series = @_; my %count; $count{$_}++ for @series; my $norm = sum values %count; $_ /= $norm for values %count; my %score = %count; my $metric = 0; for my $elem (@series) { $score{$elem} -= 1; for (keys %score) { $metric += $score{$_}**2; $score{$_} += $count{$_}; } } return $metric/@series; } [download] Read more... Here are the candidate algorithms (3 kB) sub evaluate { my $cand = shift; my $bags = shift; printf "\n%20s %15s %15s %s\n", "Monk", "Better:higher", "Bett +er:lower", "Result"; print "------------------------------------------------------- +-----\n\n"; for my $bag (@$bags) { print "bag: "; print join ", ", map { +"$_ => ".$bag->{$_} } sort key +s %$bag ; print "\n\n"; for my $monk ( sort keys %$cand ) { my $result = $cand->{$monk}->($bag); my $disp = dispersion( @$result ); my $score = measure( @$result ); printf "%20s %15.2f %15.2f %s\n", $monk, $disp +, $score, join "", @$result; } print "\n"; } } my $bags = [ { A => 4, B => 2, C => 3, D => 1, }, { A => 4, B => 2, C => 3, D => 1, F => 2, G => 4, H=> 2, I=>30 + }, { A => 10, B=> 10, C => 10, }, ]; my $candidates = { # sub expects hash ref and returns array ref kennethk => \&gen, BrowserUK => \&genUK, kcott => \&kcott, hdb => \&mix, }; evaluate( $candidates, $bags ); [download] which gives the following output Read more... (2 kB)	[reply] [d/l] [select]
Re: Bag uniform distribution algorithms by Laurent_R (Canon) on Apr 25, 2013 at 21:23 UTC
What class of problem is this? I thought maybe bin packing, but the bins are size one, and bin packing doesn't concern itself with uniformity of frequency. Maybe it could be classified as a special case of the "partition problem" or "treasure partitioning problem". Although there are some significant differences with the classical description of the partition problem.	[reply]
Re: Bag uniform distribution algorithms by LanX (Saint) on Oct 27, 2013 at 21:38 UTC
I'm not sure how "as uniformly distributed as possible" can be qualified ... Do you have a test-code to check the "quality" of a solution? But sorting according to a weighting function gives similar results like shown by you. DB<171> %h=( A => 4, B => 2, C => 3, D => 1 ) => ("A", 4, "B", 2, "C", 3, "D", 1) DB<172> $sum=0; $sum+=$_ for values %h; => "" DB<173> @list=map { my ($k,$v)=($_,$h{$_}); my $int=$sum/$v; map { + [ $k => $int*($_-.5)] } 1..$v } keys %h => ( ["A", "1.25"], ["A", "3.75"], ["A", "6.25"], ["A", "8.75"], ["D", 5], ["C", "1.66666666666667"], ["C", 5], ["C", "8.33333333333333"], ["B", "2.5"], ["B", "7.5"], ) DB<174> map {$_->[0]} sort { $a->[1] <=> $b->[1] or $a->[0] cmp $b-> +[0]} @list => ("A", "C", "B", "A", "C", "D", "A", "B", "C", "A") [download] Changing the weighting function would also allow to repeat the pattern in a way that joined sequences are still equally distributed ( that is A doesn't neighbor A ) An iterator-version shouldn't be too difficult. Cheers Rolf ( addicted to the Perl Programming Language) update code simplified update well after second thought it's quite easy to find input where this approach fails ... never mind! :(	[reply] [d/l]
Re: Bag uniform distribution algorithms by LanX (Saint) on Oct 28, 2013 at 17:09 UTC
I'd say this is an (discrete) optimization problem, which kind depends on the distance function to minimize (i.e. distance from idealized optimal solution) One possibility is to measure the integer distances between elements of one bag and to sum up the deltas to the idealized real number distance. You can also sum up the quadratic deltas and take the square root (Norm_(mathematics)#Euclidean_norm). Which "norm" to take depends on your intuitive understanding of "uniformity". I think you can get very good results with heuristic approaches involving some random elements, but w/o guaranty of being optimal. Maybe of interest, the following algorithm will calculate all >12000 combinations of your bags, you can use the output to test different distance functions (or norms) to refine your understanding of "uniformely distributed". Please note the flag '$MODULO_ROTATION' which allows to limit to the subset of solutions which can generate all other solutions by rotating the bytes, this might facilitate calculation of the distance. use v5.10.0; use warnings; use strict; use Data::Dump qw/pp/; #my @sets = ( ["A".."C"], [("-") x 3] ); my (@path,@results); my %bag = ( A => 4, B => 2, C => 3, D => 1 ); my @sets = (); my $MODULO_ROTATION = 1; if ($MODULO_ROTATION ){ delete $bag{D}; push @path,"D"; } push @sets, [ ($_) x $bag{$_} ] for keys %bag; sub branch { my $done=1; for my $set (@sets){ if (@$set) { $done=0; push @path, shift @$set; branch(); unshift @$set, pop @path; } } if ($done){ push @results, join "",@path; } } branch(); pp \@results; [download] Of course you could already combine this slow branching approach with a distance function which avoids walking thru inefficient sub-tree for a branch-and-bound solution... (i.e. bound if the distance so far already exceeds the known local minimum) But I doubt you would want to use this in praxis... Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l]
Re^2: Bag uniform distribution algorithms by davido (Cardinal) on Oct 28, 2013 at 19:22 UTC
This is a nice path to follow. Thanks! :) Dave	[reply]

update

update