NERDVANA has asked for the wisdom of the Perl Monks concerning the following question:

Brain teaser alert! This is more tricky than you would expect.

I'm looking for an algorithm that can take lists of strings that are in a user-declared logical order, and merge those lists in the way that preserves the most of those logical orderings.

Example:

use Test2::V0; use v5.40; my @list1= qw( K B I S Y A Q ); my @list2= qw( B S N A ); my @list3= qw( Y N Q ); my $combined= fancy_algorithm(\@list1, \@list2, \@list3); is( $combined, [qw( K B I S Y N A Q )] );
For bonus points, it should return a stable output when there are ambiguities (such as if there is no list which establishes the order between two elements), and if the input disagrees with itself it should select the ordering that occurs more frequently.
use Test2::V0; use v5.40; my @lists= ( [qw( X P Y )], [qw( X B Y N )], [qw( P B N )], [qw( X B P N )], [qw( X B P Y )], [qw( A Z )], [qw( A K L )], ); my $combined= fancy_algorithm(@lists); is( $combined, [qw( X B P Y N A Z K L )] );

Why is it relevant? because on more than one occasion I've wanted to export user data with the columns in the "natural order", but I'm combining multiple independent data sets and for each dataset the user has chosen the column names according to some ordering that makes sense to them but which the system has no knowledge of. I have a solution that "mostly works", but I keep finding edge cases where it breaks. I'm curious if there is a complete solution to the problem.

Edit: ...and hopefully more efficient than just running every permutation of output looking for the one with the highest score. Sometimes there are a lot of columns and that could be bad.

Replies are listed 'Best First'.
Re: Merge lists of ordered strings into canonical ordering
by tybalt89 (Monsignor) on Sep 19, 2024 at 21:52 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11161855 use warnings; use List::AllUtils qw( min_by nsort_by ); $SIG{__WARN__} = sub { die @_ }; my @list1= qw( K B I S Y A Q ); my @list2= qw( B S N A ); my @list3= qw( Y N Q ); my $combined= fancy_algorithm(\@list1, \@list2, \@list3); use Data::Dump 'dd'; dd 'combined', $combined; my @lists= ( [qw( X P Y )], [qw( X B Y N )], [qw( P B N )], [qw( X B P N )], [qw( X B P Y )], [qw( A Z )], [qw( A K L )], ); $combined= fancy_algorithm(@lists); use Data::Dump 'dd'; dd 'combined', $combined; sub fancy_algorithm { my (%names, @order); @names{ @$_ } = () for my @args = @_; use Data::Dump 'dd'; dd 'input', \@args; while( @args ) { my %names; my $see = 1; $names{$_} += $see++ for map @$_, @args; my @names = nsort_by { $names{$_} } sort +keys %names; my %score; for ( @args ) { my ($head, @rest) = @$_; $score{$head} -= 1e9; $score{$_} //= 0 for @rest; } my $pick = min_by { $score{$_} } @names; push @order, $pick; $_ = [ grep $pick ne $_, @$_ ] for @args; @args = grep @$_, @args; } return \@order; }

    Outputs:

    ( "input", [ ["K", "B", "I", "S", "Y", "A", "Q"], ["B", "S", "N", "A"], ["Y", "N", "Q"], ], ) ("combined", ["K", "B", "I", "S", "Y", "N", "A", "Q"]) ( "input", [ ["X", "P", "Y"], ["X", "B", "Y", "N"], ["P", "B", "N"], ["X", "B", "P", "N"], ["X", "B", "P", "Y"], ["A", "Z"], ["A", "K", "L"], ], ) ("combined", ["X", "B", "P", "Y", "N", "A", "Z", "K", "L"])
      Very interesting! It took me a while to figure out what you were doing here - basically just assuming that the next in the sequence must be found at the start of the remaining options, choosing one that is either the most occurrences at starts of remaining lists or first seen, removing it from the options, then repeat? It seems like you can simplify that a bit though. (I'm not sure what the 1e9 was about unless that was from an earlier attempt?)
      sub fancy_algorithm { my @args= @_; p @args; my @order; while (@args) { my ($pick, %score, %seen_order); my $see = 1; $seen_order{$_} += $see++ for map @$_, @args; for (@args) { my $n= --$score{$_->[0]}; $pick= $_->[0] if !defined $pick or $n < $score{$pick} or $n == $score{$pick} && $seen_order{$_->[0]} < $seen_ord +er{$pick}; } push @order, $pick; $_ = [ grep $pick ne $_, @$_ ] for @args; @args = grep @$_, @args; } return \@order; }

      It seems to work in every case I've tried, though it gives less ideal results with:

      my @lists= ( [qw( S T )], [qw( S U )], [qw( Y Z )], [qw( X Y )], [qw( W X )], [qw( V W )], [qw( U V )], ); # S U V W X Y T Z
Re: Merge lists of ordered strings into canonical ordering
by tybalt89 (Monsignor) on Sep 20, 2024 at 16:24 UTC

    Sometimes algorithms are shorter as strings instead of arrays.
    Not quite the same answer as yours, but I'd claim it's good enough :)

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11161855 use warnings; use List::AllUtils qw( max_by ); $SIG{__WARN__} = sub { die @_ }; my @list1= qw( K B I S Y A Q ); my @list2= qw( B S N A ); my @list3= qw( Y N Q ); my $combined= fancy_algorithm(\@list1, \@list2, \@list3); use Data::Dump 'dd'; dd 'combined', $combined; my @lists= ( [qw( X P Y )], [qw( X B Y N )], [qw( P B N )], [qw( X B P N )], [qw( X B P Y )], [qw( A Z )], [qw( A K L )], ); $combined= fancy_algorithm(@lists); use Data::Dump 'dd'; dd 'combined', $combined; @lists= ( [qw( S T )], [qw( S U )], [qw( Y Z )], [qw( X Y )], [qw( W X )], [qw( V W )], [qw( U V )], ); $combined= fancy_algorithm(@lists); use Data::Dump 'dd'; dd 'combined', $combined; sub fancy_algorithm { my $rule = join "\n", map "@$_", @_; my @order; print "$rule\n"; # FIXME only for testing while( $rule =~ /\S/ ) { my ($head) = max_by { my $n = () = $rule =~ /\b$_\b/g } grep { $rule !~ /\w +$_\b/ } my @front = sort $rule =~ /^ *(\w+) +\b/gm; $head //= shift @front; # for breaking cycles push @order, $head; $rule =~ s/\b$head\b//g; } return \@order; }

    Outputs:

    K B I S Y A Q B S N A Y N Q ("combined", ["K", "B", "I", "S", "Y", "N", "A", "Q"]) X P Y X B Y N P B N X B P N X B P Y A Z A K L ("combined", ["X", "A", "K", "L", "Z", "B", "P", "Y", "N"]) S T S U Y Z X Y W X V W U V ("combined", ["S", "U" .. "Y", "T", "Z"])
Re: Merge lists of ordered strings into canonical ordering
by hv (Prior) on Sep 19, 2024 at 18:46 UTC

    I think a graph approach makes sense, as suggested by Anonymous Monk. I think you can do this using a directed Graph with countedged to record the fact that an ordering has been seen multiple times, then break loops by reducing the weight of all edges in the loop by the minimum of them and removing any edge that reaches zero. It's not immediately obvious to me what should happen when there are multiple intersecting loops - you'd probably need to experiment to find a viable approach for that case.

Re: Merge lists of ordered strings into canonical ordering
by Corion (Patriarch) on Sep 19, 2024 at 18:18 UTC

    I would look at Algorithm::Diff or simply sequentially merging the columns into the master result until all columns have been merged into their respective places.

      The problem with this approach is it can't preserve the unknown status of the orderings, to be resolved by later orderings. In the first example, "S Y A" and "S N A" should combine in a way that expresses a very low preference for Y coming before N, and then that low preference should get overridden by the strong preference for Y coming before N by the 3rd rule. (and it would have made a better example if I put N before Y in the 3rd rule because then the strong preference would be contrary to the default order)
Re: Merge lists of ordered strings into canonical ordering
by Anonymous Monk on Sep 19, 2024 at 18:15 UTC
      Topological sort is basically it, but with the catch that sometimes the user contradicts themself.
Re: Merge lists of ordered strings into canonical ordering
by LanX (Saint) on Sep 19, 2024 at 21:00 UTC
    I missed your second test, not sure if it's because --

    • I don't understand your criteria "select the ordering that occurs more frequently"
    • avoiding ambiguity can't be tested
    • or my code is faulty

    Could you please elaborate where and why your $combined2 is better?

    (UPDATE: If there is a quantifiable reason why "X" should be ordered before "A", I'm sure this can be added to the sort function, see cmp_more_frequently($a,$b) )

    The basic idea is to build a sort function which sorts by the combined less-then relation stored in a HoH %cmp.

    Contradictory orders are calculated by increments and decrements.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    updates

  • renamed %lt to %cmp
  • DISCLAIMER: This is a hack, and can be done better and more elegantly, and I'm sure it lacks proper testing.
  • OK there is definitely a bug, looking into it
      So I fixed the bug in the $osort (precedence) and my second "non-ambiguity" test still "fails", but is only diverging for "Z" from yours.

      And I think my solution is better because Z has only one relation. (YMMV)

      use v5.12; use warnings; use Data::Dump qw/pp dd ddx/; use Test::More; my @list1= qw( K B I S Y A Q ); my @list2= qw( B S N A ); my @list3= qw( Y N Q ); my $combined = combined_order(\@list1, \@list2, \@list3); my $expected = [qw( K B I S Y N A Q )]; is_deeply( $combined, $expected ) or diag pp { got => $combined, exp => $expected }; my @lists2= ( [qw( X P Y )], [qw( X B Y N )], [qw( P B N )], [qw( X B P N )], [qw( X B P Y )], [qw( A Z )], [qw( A K L )], ); # my $dbg_sort = gen_sort_by_orders(@lists2); # ($a,$b) = qw/B X/; # is(signum(&$dbg_sort), 1, "$a cmp $b"); # ($a,$b) = qw/X B/; # is(signum(&$dbg_sort), -1, "$a cmp $b"); my $combined2 = combined_order(@lists2); my $expected2 = [qw( X B P Y N A Z K L )]; is_deeply( $combined2, $expected2 ) or diag pp { got => $combined2, exp => $expected2 }; sub combined_order { my @orders = @_; my @total = map {@$_} @orders; my %uniq; @uniq{@total} =(); @total = keys %uniq; my $osort = gen_sort_by_orders (@orders); #ddx @total; return [ sort {&$osort} @total ]; } sub signum { my ($val) = @_; return 0 unless $val; # 0 or undef return $val / abs($val); } sub gen_sort_by_orders { my @orders = @_; my %cmp; # relation matrix my %count; for my $order (@orders) { my @list = @$order; while (my $before = shift @list) { for my $after (@list) { # comparing lesser < greater is enough bc of symmetry $before lt $after ? $cmp{$before}{$after}-- : $cmp{$after}{$before}++; # we count the relations for second criteria $count{$before}++; $count{$after}++; } } } ddx { "Matrix:" => \%cmp, "Count:" => \%count }; # debu +g my $osort = sub { return ( $a lt $b ? ( $cmp{$a}{$b} // 0 ) : - ( $cmp{$b}{$a} // 0 ) ) || ( $count{$b} - $count{$a} ) # more relations win || $a cmp $b # default anti ambiguity }; return $osort; } done_testing;

      perl /home/lanx/perl/multisort.pl # multisort.pl:83: { # "Count:" => { A => 9, B => 9, I => 6, K => 6, N => 5, Q => 8, S = +> 9, Y => 8 }, # "Matrix:" => { # A => { B => 2, I => 1, K => 1, N => 1, Q => -1, S = +> 2, Y => 1 }, # B => { I => -1, K => 1, N => -1, Q => -1, S => -2, +Y => -1 }, # I => { K => 1, Q => -1, S => -1, Y => -1 }, # K => { Q => -1, S => -1, Y => -1 }, # N => { Q => -1, S => 1, Y => 1 }, # Q => { S => 1, Y => 2 }, # S => { Y => -1 }, # }, # } ok 1 # multisort.pl:83: { # "Count:" => { A => 3, B => 11, K => 2, L => 2, N => 8, P => 10, X + => 11, Y => 8, Z => 1 }, # "Matrix:" => { # A => { K => -1, L => -1, Z => -1 }, # B => { N => -3, P => -1, X => 3, Y => -2 }, # K => { L => -1 }, # N => { P => 2, X => 2, Y => 1 }, # P => { X => 3, Y => -2 }, # X => { Y => -3 }, # }, # } not ok 2 # Failed test at /home/lanx/perl/multisort.pl line 39. # Structures begin differing at: # $got->[6] = 'K' # $expected->[6] = 'Z' # { # exp => ["X", "B", "P", "Y", "N", "A", "Z", "K", "L"], # got => ["X", "B", "P", "Y", "N", "A", "K", "L", "Z"], # } 1..2 # Looks like you failed 1 test of 2.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        Thanks! This is pretty similar to what I had, but yours works :-)

      Could you please elaborate where and why your $combined2 is better?

      (UPDATE: If there is a quantifiable reason why "X" should be ordered before "A", I'm sure this can be added to the sort function, see cmp_more_frequently($a,$b) )

      It wasn't really a requirement, other than that every time you give the algorithm those inputs in that order, you get the same output. In other words I don't want it to depend on hash randomization, so I think original input order should be used as a tie-breaker. When I built the example, I put 'A' after everything else since there was no relation to the previous rules and the rule with 'A' came later. Same with 'Z' vs. 'K'.

Re: Merge lists of ordered strings into canonical ordering
by ikegami (Patriarch) on Sep 20, 2024 at 16:32 UTC

    Nevermind. Second example contradicts this.


    You're asking for an order-preserving duplicate remover.

    my %seen; my @combined = grep !$seen{ $_ }++, @list1, @list2, @list3;
    use List::Util qw( uniq ); my @combined = uniq @list1, @list2, @list3;