SlackBladder has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow Monks, I have a problem which I have been mulling over for a couple of weeks now. The problem is to do with the correct order for loading data into a RDBMS from a data set extracted from another system...
1, I have a dataset (extracted from MSSQL Server) which contains two columns (only ever two columns) but several thousand records.
2, The data in the left column can be thought of as the "child", the data in the right can be thought of as the "parent"
3, If the two columns match then this piece of data is both parent and child and can be loaded at any time.
4, If the two are different then the child has a parent.
5, The parent can have many children, which means that this parent needs to be loaded before it's children.
6, It's possible for the parent to also be a child (child, parent, grandparent), which means the grandparent need to be loaded before the parent, which is before the child..
It's a conundrum. Basically I need to create a list of values in the correct order so that a "dependent" is always loaded after the parent otherwise the load simply will not work. I was hoping that there would be a PERL module which would fit my requirements, or, some pretty smart work with arrays and/or hash's . Any suggestions guys ?? SlackyB
  • Comment on Perl modules or standard tools for searching hierachical data

Replies are listed 'Best First'.
Re: Perl modules or standard tools for searching hierachical data
by kyle (Abbot) on Mar 12, 2007 at 17:55 UTC

    If you have enough memory, you could build up a dependency tree. Otherwise, I think you have to make multiple passes.

    For each entry:

    1. If it's a root node (columns match), load it now.
    2. If it's not a root node, check to see if its parent has been loaded yet.
    3. If its parent is there, load it.
    4. If the parent is not there, skip over it.
    5. If you've loaded this entry, remove it from your list before going on to the next entry.

    Do that loop until the list of entries is empty. You can expect to loop over the list once for every generation.

Re: Perl modules or standard tools for searching hierachical data
by agianni (Hermit) on Mar 12, 2007 at 18:26 UTC

    Tree should provide the tree management functionality (building and traversing), you'll simply need to write the code to build your tree. If your records are in the thousands and not tens or hundrends of thousands, you should be fine from a memory perspective, although the only way to really know is to try.

    Alternately, since it sounds like you're dealing with referential integrity issues, you might consider the non-Perl solution of turning off the integrity constraints before you load and just have at it. If you load anything incorrectly, you won't be able to turn the constraints back on. Since you're copying this from an other data source, which I'm assuming is integral, this should work fine. It's an inelegant solution, but it should be pretty easy.

Re: Perl modules or standard tools for searching hierachical data
by graff (Chancellor) on Mar 13, 2007 at 04:17 UTC
    Basically I need to create a list of values in the correct order so that a "dependent" is always loaded after the parent otherwise the load simply will not work.

    It took me a while to get my head around the issue, but I think it's accurate to rephrase it like this:

    In order for the data set to be loaded correctly as a whole, records containing top-level parents must be loaded before those containing intermediate parents. In other words, whenever a given value is first seen in the "parent" column, insertion of that record must be deferred until either:

    1. it is established that this value never occurs as the "child" in another record, or
    2. the record containing this value as "child" has already been stored to the database.
    (Meanwhile, records having the same value as both parent and child may be loaded at any time in the sequence.)

    Supposing that's right, here's a demonstration on a small sample data set, using a hash structure to keep track of parent and child relations for each field value, and a recursive function to print out the records in the order required. (Recursion is not needed to load the hash.)

    All this does is sort the input lines into the desired order (if a value serves as both child and parent, the record that has the value as child always comes first). This works well for loading the database, because you'll probably want to use whatever native loader is available for the particular database server you're using (inserts via DBI are very slow in comparison).

    #!/usr/bin/perl use strict; use warnings; my %node; while (<DATA>) { my ( $c, $p ) = split; if ( $c eq $p ) { # these are easy, so finish them first print; next; } if ( exists( $node{$c}{child_of} )) { warn "$.: bad record: $c is child of both $p and $node{$c}{chi +ld_of}\n"; next; } $node{$c}{child_of} = $p; $node{$p}{parent_of}{$c} = undef; } # begin the sorted output by looping over values that do not have pare +nts: for my $parent ( grep { !exists( $node{$_}{child_of} ) } keys %node ) +{ my $children = $node{$parent}{parent_of}; # ref to hash of child +values trace_down( $children, \%node ); } sub trace_down { my ( $kids, $tree ) = @_; for my $kid ( keys %$kids ) { print "$kid $$tree{$kid}{child_of}\n"; if ( exists( $$tree{$kid}{parent_of} )) { trace_down( $$tree{$kid}{parent_of}, $tree ); } } } __DATA__ n1 n2 n3 n2 n4 n1 n5 n5 n6 n4 n7 n7 n8 n9 n10 n2 n11 n6 n2 n12 n7 n6 n8 n4
    Note that I took the liberty of enforcing a "one-parent-only" constraint (the last line of DATA causes a warning, and is ignored). But I wasn't clear about the status of values that occur as both parent and child in a single record. (Should the next-to-last line of DATA cause a warning as well?) The special treatment of the "child eq parent" records might be problematic...

    (Update: moved the line in the main "while" loop that stores the "parent_of" relation, so that it happens after the "one-parent-only" condition has been passed; if it happens before that check, it puts a redundant entry in the hash structure.)

      Now I know I am a PERL "novice" !!
      Thank you for the "wisdom" graff.
      I have updated my code so that the original stuff I had (data extraction from SQL Server using "Win32::SqlServer") can be used by the code you kindly supplied. I think that it's a excellent base but after checking the result set I think that perhaps I did not do a good job of describing some "irregularities" in the data set.
      For example, the text "A1-910-ES-002B" can be found as a parent to "A1-910-ES-002B-16" (the actual record is)
      A1-910-ES-002B-16|A1-910-ES-002B (pipe symbol is the delimiter).
      Unfortunately, the text "A1-910-ES-002B" is also found in the child number "A1-910-ES-002B-17", which has a parent of "A1-910-ES-002B" (full record is)
      A1-910-ES-002B-17|A1-910-ES-002B
      I have a sneaking suspicion that the grep command is matching not only the text in isolation but any line that has that text in it.
        I have a sneaking suspicion that the grep command is matching not only the text in isolation but any line that has that text in it.

        If there's a problem, I don't think it involves the grep operation in the code I suggested. In the "for" loop over top-level parents, the grep does not involve a regex match, and is not susceptible to "false-alarm" substring matches. It is only returning the keys for those elements in the %node hash that are not the children of other nodes. Hash key lookups are always based on exact matches -- "abc" does not match "abcd" as a hash key.

        The two sample records you cited involve two child values having the same parent value, so if I understand what you've said so far, their ordering relative to each other should not matter. If that parent value happens to be the child in some other record, it would matter to have that other record output before these two.

        Since you have pipe-delimited data, I'm assuming you've extracted the data from the older server into a text file, and you're reading from that file in order to sort it, which makes perfect sense. Of course, my code would need to be adjusted slightly for pipe-delimited input instead of space-delimited:

        while (<DATA>) { chomp; my ( $c, $p ) = split /\|/; ... # update: also need to change the print statement in sub trace_down(): ... print "$kid|$$tree{$kid}{child_of}\n"; ...
        Apart from that, if there's still a problem, you would need to post a minimal demonstration -- a snippet like mine, including sufficient "real-world" data, that exhibits the problem, and perhaps some clarification as to how the actual output differs from the desired output.