Basically I need to create a list of values in the correct order so that a "dependent" is always loaded after the parent otherwise the load simply will not work.
It took me a while to get my head around the issue, but I think it's accurate to rephrase it like this:
In order for the data set to be loaded correctly as a whole, records containing top-level parents must be loaded before those containing intermediate parents. In other words, whenever a given value is first seen in the "parent" column, insertion of that record must be deferred until either:
- it is established that this value never occurs as the "child" in another record, or
- the record containing this value as "child" has already been stored to the database.
(Meanwhile, records having the same value as both parent and child may be loaded at any time in the sequence.)
Supposing that's right, here's a demonstration on a small sample data set, using a hash structure to keep track of parent and child relations for each field value, and a recursive function to print out the records in the order required. (Recursion is not needed to load the hash.)
All this does is sort the input lines into the desired order (if a value serves as both child and parent, the record that has the value as child always comes first). This works well for loading the database, because you'll probably want to use whatever native loader is available for the particular database server you're using (inserts via DBI are very slow in comparison).
#!/usr/bin/perl
use strict;
use warnings;
my %node;
while (<DATA>) {
my ( $c, $p ) = split;
if ( $c eq $p ) { # these are easy, so finish them first
print;
next;
}
if ( exists( $node{$c}{child_of} )) {
warn "$.: bad record: $c is child of both $p and $node{$c}{chi
+ld_of}\n";
next;
}
$node{$c}{child_of} = $p;
$node{$p}{parent_of}{$c} = undef;
}
# begin the sorted output by looping over values that do not have pare
+nts:
for my $parent ( grep { !exists( $node{$_}{child_of} ) } keys %node )
+{
my $children = $node{$parent}{parent_of}; # ref to hash of child
+values
trace_down( $children, \%node );
}
sub trace_down
{
my ( $kids, $tree ) = @_;
for my $kid ( keys %$kids ) {
print "$kid $$tree{$kid}{child_of}\n";
if ( exists( $$tree{$kid}{parent_of} )) {
trace_down( $$tree{$kid}{parent_of}, $tree );
}
}
}
__DATA__
n1 n2
n3 n2
n4 n1
n5 n5
n6 n4
n7 n7
n8 n9
n10 n2
n11 n6
n2 n12
n7 n6
n8 n4
Note that I took the liberty of enforcing a "one-parent-only" constraint (the last line of DATA causes a warning, and is ignored). But I wasn't clear about the status of values that occur as both parent and child in a single record. (Should the next-to-last line of DATA cause a warning as well?) The special treatment of the "child eq parent" records might be problematic...
(Update: moved the line in the main "while" loop that stores the "parent_of" relation, so that it happens after the "one-parent-only" condition has been passed; if it happens before that check, it puts a redundant entry in the hash structure.) |