anonymized user 468275 has asked for the wisdom of the Perl Monks concerning the following question:

I am running 64 bit active perl v5.22.1 on a windows 10 dual AMD box. I am writing a generic database load routine that reads a CSV per table and generates and executes SQL into Postgres. The CSV's have table declarations first followed by a token __COLUMNS__ followed by the data to insert in the table. One of the table declarations I configure takes the form: "FK", local column, foreign table name, foreign PK. So I have to sort the CSVs so that wherever such a declaration exists, the foreign table is created before the table in which the FK declaration is made. For simplicity, I load all the CSVs first into a hash of array of array, table name is hash key with value being the CSV data for that table. Some issues and points of interest arise from this.

1) Given the basics of the sorting (structure is in $cvsQ):

$obj->_getset('csvQ', $csvQ); for my $csvq ( sort _DBsort4create keys %$csvQ ) { $obj->_getset('ar', $csvq); $obj->_arr2db; # create and populate table }
...the table names appear in pairs $a and $b and the sort routine needs access to $obj to compare the CSV data for the two tables. When I tried passing parameters to the sort routine or putting $obj-> in front of the sort routine call, I got a syntax error. But when I removed the $obj-> from the call, this corrected the syntax error. But strangely, the code below which shifts the first parameter magically DWIMmed by getting the right object...
sub _DBsort4create { my $obj = shift; my $csvref = $obj->_getset('csvQ'); die Dumper( $csvref->{$a} ); # next steps: 1) check for comparison already done # 2) if so just return old result # 3) get FK declaration, store implied # comparison unless contradicts history # if contradiction, give message that database design cannot be physically implemented }
the above code therefore displayed a dump of the AoA for the first key about to be sorted in the HoAoA. So how did it acquire $obj as a parameter given that the sort routine is called bare? There is a CPAN module that provides a sort routine for objects but that isn't the same thing at all. I could pass the structure via a package variable but then I haved two evils to choose from: global variables versus undocumented features.

2) If $a does not reference $b or vice versa, I can return 0 otherwise I can optimise by retrieving a previous comparison of the same pair before delving into the structure if already compared. But the problem is that there could be circular references that cannot be pairwise detected which one would expect to cause sort to go into an infinite loop. So the second issue is: given the list of established pairwise comparisons, how to test for an implied contradiction with the current iteration of sort e.g. M > N and N > P but one of the current pair declares P > M and so would cause an infinite loop if the sort routine were allowed to continue. And of course this suggests I need to do something other than pairwise sort in the first place.

The best idea I have is to build the history of comparisons as a list of groups (hashes) of comparisons instead of a list of comparisons, so that elements not explicitly sorted are left unsorted in each hash and the criteria for predicting an infinite loop becomes an attempt to put the same table in more than one group. Or perhaps someone has a less esoteric way to perform the test?

Many thanks!

Simon

One world, one people

  • Comment on Sort mechanics problems with objects and potentially contradicting comparisons (would cause infinite loop)
  • Select or Download Code

Replies are listed 'Best First'.
Re: Sort mechanics problems with objects and potentially contradicting comparisons (can't cause infinite loop)
by tye (Sage) on Jun 01, 2016 at 16:51 UTC
    which one would expect to cause sort to go into an infinite loop

    Not this "one". And the documentation is pretty clear that this can't happen:

    In 5.7, the quicksort implementation was replaced with a stable mergesort algorithm whose worst-case behavior is O(NlogN).

    An infinite loops is a way worse case than O(NlogN).

    And, no, I'm pretty sure that the following bit in the documentation doesn't trump that part:

    If it returns inconsistent results (sometimes saying $x[1] is less than $x[2] and sometimes saying the opposite, for example) the results are not well-defined.

    "Not well-defined" means that you might not be able to predict the order in which the items are returned. All of the items will be returned, one alias each, and in under O(NlogN) "time". It is just their precise order that is not well-defined.

    - tye        

Re: Sort mechanics problems ...
by haukex (Archbishop) on Jun 01, 2016 at 18:34 UTC
    One of the table declarations I configure takes the form: "FK", local column, foreign table name, foreign PK. So I have to sort the CSVs so that wherever such a declaration exists, the foreign table is created before the table in which the FK declaration is made.

    Why not use a Dependency graph?

    use Graph; my $g = Graph->new(directed => 1); $g->add_edge(split /:/) for qw/ a:d a:b c:b d:c /; my @order = $g->topological_sort; print join(", ", @order), "\n"; __END__ a, d, c, b

    Hope this helps,
    -- Hauke D

      You are indeed correct that the classic solution to the problem is a graph and the type of sort is topological. However, the perl code is not compelled to support every step of the theory. It turns out to be much simpler in practice with perl sort:
      if $a is dependent on $b, return -1 if the reverse, return 1 otherwise return 0
      this lets $a and $b remain unsorted relative to each other where there is no dependency.

      One world, one people

        ...But, as you've discovered, it's not actually that simple. Another issue is that you'll need a stable sort algorithm for this to work (see the sort pragma). We're trying to tell you that sort is really not the right tool for this job.
      This was definitely the most helpful answer. Thanks for taking the trouble!

      One world, one people

Re: Sort mechanics problems ...
by haukex (Archbishop) on Jun 01, 2016 at 15:52 UTC

    The syntax for custom sort routines is sometimes tricky to get right - towards the end of the sort page there are some examples of the syntax. It sounds like your problem might be fixed with the syntax sort {$obj->_DBsort4create(...args...)} keys %$csvQ, but I'm not sure as you haven't provided runnable example code...

    As for your second question, I haven't yet wrapped my head around the potential problem you're describing (no sample code that demonstrates the infinite loop?), but if you're asking about optimizing the sort, then maybe a Schwartzian transform will help?

    Regards,
    -- Hauke D

      Yes I agree with the idea of enclosing the sort call in a block, especially given that Perl documents this approach and it appears all the mechanics are fine, except the potential for infinite loop needing to be caught.

      The problem with sample code that is already wrong is that it night have a different problem from the same theoretical weakness. So the issue of circular dependencies should be resolved in theory before writing any more code.

      The optimisation part is best done using an orcish manoeuvre than a schwartzian transform because it is a question of not doing the same processing twice although the sort algorithm will call the function many times with the same table appearing in $a or $b for successive iterations. Except that this time, apart from storing past processing results of individual elements to prevent reiterating them I need also to store past comparisons to check for circular dependencies.

      One world, one people

        The problem with sample code that is already wrong is that it night have a different problem from the same theoretical weakness. So the issue of circular dependencies should be resolved in theory before writing any more code.

        Even if we discuss theory, you still need to communicate the potential problem somehow. So I disagree that code isn't useful at this point - it should be possible to construct a piece of sample code that demonstrates the infinite loop issue that you are worried about, even if it takes one or two iterations.

        Regards,
        -- Hauke D

Re: Sort mechanics problems with objects and potentially contradicting comparisons (would cause infinite loop)
by ikegami (Patriarch) on Jun 01, 2016 at 16:50 UTC

    But strangely, the code below which shifts the first parameter magically DWIMmed by getting the right object.

    I don't believe you.

    perl -e' use strict; use warnings; use Data::Dumper qw( Dumper ); sub _DBsort4create { my $obj = shift; print(Dumper($obj)); 0 } my $csvQ = { x => 123, y => 456 }; for my $csvq ( sort _DBsort4create keys %$csvQ ) { } ' $VAR1 = undef;

    If it does happen to return the right object, you are modifying a stack you shouldn't be modifying, which should lead to dire repercussions.

      In your example no object has been blessed

      One world, one people

        Do you really think that adding my $o = bless({}); is going to make a difference? Why? Did you try it?

      You're right, ik, the code will not work as the OP claims. @_ does not get set when _DBsort4create is called by sort; it's left holding the argument list of the sub that called sort.

      sub compare { my $x = shift; print "\$a = $a, \$b = $b, \$x = $x, \@_ = @_\n"; $a cmp $b; } sub test { sort compare 'z','a','b' } my @z = test('foo', 'bar', 'baz'); print "\@z = @z\n"; output: $a = z, $b = a, $x = foo, @_ = bar baz $a = a, $b = b, $x = bar, @_ = baz $a = b, $b = z, $x = baz, @_ = @z = a b z
Re: Sort mechanics problems with objects and potentially contradicting comparisons (would cause infinite loop)
by Anonymous Monk on Jun 01, 2016 at 15:47 UTC
    sort { _compare($obj, $a, $b) } @stuff
      That's a way to avoid the undocumented feature, but it is also nonstandard. I'll experiment with a variation though:
      sort { $obj->_compare }
      which also passes $obj as first parameter and see if I still get $a and $b for free, which IS documented magic. Many thanks for the suggestion.

      One world, one people

        and see if I still get $a and $b for free

        You can get at them:

        use warnings; use strict; { package Bar; my $obj = Foo->new; print join ", ", sort {$obj->mysort($a,$b,"bar")} qw/x a o/; } { package Foo; sub new { bless {}, shift } use Data::Dump 'pp'; sub mysort { pp \@_; $_[0] cmp $_[1]; } } __END__ [bless({}, "Foo"), "x", "a", "bar"] [bless({}, "Foo"), "x", "o", "bar"] [bless({}, "Foo"), "o", "a", "bar"] x, o, a

        Updated example: Previously I had mysort reaching into the main package via $::a, which is obviously not the best solution, now I'm passing $a and $b as parameters. (I see the AM made the same suggestion.) Update 2: There are other solutions possible as well.

        Hope this helps,
        -- Hauke D

        That makes things far more complicated. Because _compare is likely to be in a package other than the sort, _compare would have to use ${ caller.'::a' } and ${ caller.'::b' } instead of $_[0] and $_[1]. Best to pass $a and $b are arguments.

        If you pass $a and $b explicitly, you might be able to reuse that sub outside of sort. The "efficiency" of the implicit $a and $b is not really a big deal.

        $obj->_compare($a,$b)
Re: Sort mechanics problems with objects and potentially contradicting comparisons (would cause infinite loop)
by Anonymous Monk on Jun 01, 2016 at 15:51 UTC
    For part 2, it sounds like you're trying to do some kind of topological sort, but I'm not sure of the details.
      There are several modules available, but I haven't tried them. It may actually be easier to roll your own than to figure out how the modules work. topological sort
        I am beginning to agree with you on that. Although I need both a sorted list of groups and a hash for looking up what is in which group to do the homegrown thing, it is not so big a stretch to go from the orcish manoeuvre to a group sort on the same basis that handles the topological versus circular dependency scenario safely.

        One world, one people

Re: Sort mechanics problems with objects and potentially contradicting comparisons (would cause infinite loop)
by anonymized user 468275 (Curate) on Jun 02, 2016 at 08:03 UTC
    So I decided to let the database handle any circular dependencies (it will fail to define one of the tables and I have to then roll back everything) but apart from that I have a working topological sort. Thanks to all who helped out.
    my $csvQ = {}; for my $csv ($obj->_globcsv) { my $base = basename $csv; $base =~ s/\.txt$//; $base = lc($base); $obj->_csv2arr($csv); $csvQ->{$base}{table} = $obj->_getset('ar'); } $obj->_getset('csvQ', $csvQ); for my $table ( sort _DBsort4create keys %$csvQ ) { $obj->_getset('ar', $csvQ->{$table}{table}); $obj->_arr2db; # process the csv into the db } sub _DBsort4create { my $obj = shift; my $csvQ = $obj->_getset('csvQ'); for my $ab ($a, $b) { unless(exists($csvQ->{$ab}{ancestor})) { $csvQ->{$ab}{ancestor} = {}; ROWDBS4C: for (my $row=1; $row <= $#{$csvQ->{$ab}{table}}; $ro +w++) { # avoid first row, which is a header not csv data ($csvQ->{$ab}{table}[$row][0] eq '__COLUMNS__') and last ROWDBS4C; if ($csvQ->{$ab}{table}[$row][0] eq 'FK') { $csvQ->{$ab}{ancestor}{$csvQ->{$ab}{table}[$row][2]}=' +'; } } } } if (exists($csvQ->{$a}{ancestor}{$b})) { return 1; } elsif (exists($csvQ->{$b}{ancestor}{$a})) { return -1; } return 0; }
    Sorts by ancestral relationship of FKs. Minor optimisation similar to orcish manoevre: only search a csv array for FK entries once.

    One world, one people

Re: Sort mechanics problems with objects and potentially contradicting comparisons (would cause infinite loop)
by FloydATC (Deacon) on Jun 02, 2016 at 10:56 UTC

    I would say that the task of detecting circular references in a data structure does not involve sorting at all. Simply walk the data structure/hierarchy/whatever while keeping track of the path you followed to get there.

    Here's some untested code to show what I mean:

    sub is_circular { my $obj = shift; # Object to be traversed my $path = shift || []; # Check of $obj is in the $path foreach my $ancestor (@{$path}) { return 1 if $ancestor eq $obj; } # Check the children recursively, depth first # (My imaginary $obj conveniently has # a method for getting an array of its children) foreach my $child ($obj->children) { return 1 if is_circular($child, [ @{$path}, $obj ] ); } return 0; }

    This can be done recursively or in a linear fashion if this suits you better. Modules already exist but writing your own code would probably be faster than trying to fit an existing module with your particular data.

    -- FloydATC

    Time flies when you don't know what you're doing

      You haven't realised that any traversal would be infinite for the failing cases for the same reasons. Repeat visits to a node in the structure tell you nothing about whether the traversal is finite. And counting those is not really a direct approach. A sort is a kind of traversal too.

      One world, one people

        This problem is known as cycle detection, and there are a number of different solutions, depending on your requirements.

        Repeat visits to a node in the structure tell you nothing about whether the traversal is finite.

        I'm not sure what you mean by that. Repeat visits to a node tell you that you're in a cycle, and you'll keep going around forever unless you break out (as Floyd's code does).

        Most sort algorithms won't loop infinitely if the comparison function is inconsistent. They'll just return the list in a non-meaningful order.