in reply to Re^4: Perl modules or standard tools for searching hierachical data
in thread Perl modules or standard tools for searching hierachical data

Please put "code" tags around source code and sample data, like this:

<code>
code or data here...
</code>

or like this:

<c>
code or data here...
</c>

When you preview your post, you'll find links to formatting tips for PM.

Now, as to your sample data: you didn't do what I asked. You posted one set of input, and a completely separate set of output. The two sets seem to have nothing in common apart from the formatting, and your closing comment doesn't make sense:

I was hoping that the number "A1-4100|A1" would be fairly close to the top of the "sorted" list as "A1-4100" is the parent for a ton of numbers, however, it appears at line 1705 (...)
Huh? Why would you care about the specific line number where a given record actually shows up in the output? All that matters is that all dependent records come after it.

If that is not the case for a given run, create a small subset of the data to serve as a test set, run that through the script, and if you can show that the ordering is incorrect, post just that (hopefully small) minimal set of records that demonstrate the error.

Note that the actual ordering of output from my original code should be: the first top-level parent, then all records dependent on that, then the next top-level parent, then all records dependent on that, and so on.

(Update: note that the notion of "first top-level parent" is arbitrary; the "keys" function returens hash keys in essentially random order, so if you want to see top-level parents handled in a specific order, you'll need to sort them before they go into the grep function.)

  • Comment on Re^5: Perl modules or standard tools for searching hierachical data

Replies are listed 'Best First'.
Re^6: Perl modules or standard tools for searching hierachical data
by SlackBladder (Novice) on Mar 14, 2007 at 11:16 UTC
    Apologies for the previous "long" message..
    The reason I care about the order of the line numbers is that this is the order that the data will be loaded into the new RDBMS. A query is going to loop through the file to obtain result sets based on the values in this file. The target RDBMS (SAP) is configured so that if there is an attempt to load a child object before it's parent then the whole import fails. So, as an example, if you have the following records in a source file, where the record on the left is the child and the one on the right is the parent...
    A1-4100-YZX-002|A1-4100
    A1-4100|A1
    A1-4200-ABC-001|A1-4200
    A1-4200|A1
    A1-4100-YZX-002-01|A1-4100-YZX-002

    In this order the data will not load. What I am trying to do is re-order the same data as....
    A1|TOP
    A1-4100|A1
    A1-4100-YZX-002|A1-4100
    A1-4100-YZX-002-01|A1-4100-YZX-002
    A1-4200|A1
    A1-4200-ABC-001|A1-4200

    This would also load....
    A1|TOP
    A1-4100|A1
    A1-4200|A1
    A1-4100-YZX-002|A1-4100
    A1-4100-YZX-002-01|A1-4100-YZX-002
    A1-4200-ABC-001|A1-4200
    I don't need to worry about the very,very top number "A1" as this is pre-configured in the target database as the highest-on-high, everything else needs to follow below.

    I think that the problem may (and I stress "may") be that within the data, there are "child, father, grandfather" relationships, however, your last statement about the "random" order of the returned hash keys and returning the result set in a specific order by sorting "before" the grep function may be the "key" (sorry for the pun on the word key, couldn't help it)
    regards
    SlackyB
      Hello,
      I forgot to post the code I am using
      #!/usr/bin/perl use strict; use warnings; my %node; my $folder = "P:\\Prod-Operations\\Maint-Eng\\Maintenance Projects\\IN +F\\DB\\IFDB\\TAGHIERARCHY\\PerlScripts"; my $resultfile = "resultset.txt"; my $interset01 = "interset01.txt"; my $interset02 = "interset02.txt"; open (DATA, "< $folder\\$resultfile") || die "could not open file: $!" +; while (<DATA>) { my ( $c, $p ) = split /\|/; if ( $c eq $p ) { # these are easy, so finish them first print; next; } if ( exists( $node{$c}{child_of} )) { warn "$.: bad record: $c is child of both $p and $node{$c}{chi +ld_of}\n"; next; } $node{$c}{child_of} = $p; $node{$p}{parent_of}{$c} = undef; } # begin the sorted output by looping over values that do not have pare +nts: # open (INT01,">$folder\\$interset01") or die "Can not open file $fold +er\\$interset01 for writing, quitting\n"; for my $parent ( grep {!exists( $node{$_}{child_of} ) } keys %node ) { my $children = $node{$parent}{parent_of}; # ref to hash of child +values trace_down( $children, \%node ); } sub trace_down { my ( $kids, $tree ) = @_; for my $kid ( keys %$kids ) { # print INT01 "$kid|$$tree{$kid}{child_of}"; print "$kid|$$tree{$kid}{child_of}\n"; if ( exists( $$tree{$kid}{parent_of} )) { trace_down( $$tree{$kid}{parent_of}, $tree ); } } }
      Looking at the code you just posted (thanks for that), I think the problem is that you left out the "chomp;" when reading the records in from the input file. Why did you leave that out? It's pretty important, because without that, every "parent" srting will include the "\r\n" ("CRLF") line-termination characters, and will therefore never match a "child" string (because the child never includes "\r\n" at the end). Please try adding "chomp;" as the first line of the while loop when reading in the data, and see if that helps.

      Apart from that, when I put your five lines of sample data (thanks for that) into my original code, it came out in this order:

      A1-4100|A1 A1-4100-YZX-002|A1-4100 A1-4100-YZX-002-01|A1-4100-YZX-002 A1-4200|A1 A1-4200-ABC-001|A1-4200
      which is just like the one you said would be okay. It might also come out as:
      A1-4200|A1 A1-4200-ABC-001|A1-4200 A1-4100|A1 A1-4100-YZX-002|A1-4100 A1-4100-YZX-002-01|A1-4100-YZX-002
      and that should also be acceptable. If you got something other than those two possible outputs, it's probably because you forgot to "chomp;" the input, or maybe there are spurious other whitespace characters that you weren't aware of.

      As for the initial "TOP" record, are you sure you have that worked out fully? Would it be the case that every top-level parent (identified as such in my script) needs its own "top-level-string|TOP" record? If so, it would be easy to modify the code to make sure this is done for each of the top-level parents -- just add a print statement  print "$parent|TOP\n"; as the first thing in the main "for" loop.

      I went ahead and did that on my own copy of the script, and ran it on the huge input sample that you posted above; adding the "TOP" lines like that actually makes it easier to inspect the output for correctness, as follows:

      Use the unix "grep -n" command (there are perl versions posted at the Monastery and elsewhere -- e.g. here's mine: grepp -- Perl version of grep) to get the line numbers containing "TOP". Then, for any of those "top-level children", check the line numbers containing "\|top-level-string$". All the latter line numbers should be higher/later in the file than the corresponding TOP line.

        Hello
        I am now beginning to wonder at my own sanity as when I run your original code, replacing your data with the following
        A1-4100-YZX-002|A1-4100 A1-4100|A1 A1-4200-ABC-001|A1-4200 A1-4200|A1 A1-4100-YZX-002-01|A1-4100-YZX-002
        I get the following output
        A1-4100|A1 A1-4200-ABC-001|A1-4200 A1-4100-YZX-002-01|A1-4100-YZX-002 A1-4100-YZX-002|A1-4100 A1-4200|A1
        Which is very different to what your seeing. The full code I am running is..........
        #!/usr/bin/perl use strict; use warnings; my %node; while (<DATA>) { my ( $c, $p ) = split; if ( $c eq $p ) { # these are easy, so finish them first print; next; } if ( exists( $node{$c}{child_of} )) { warn "$.: bad record: $c is child of both $p and $node{$c}{chi +ld_of}\n"; next; } $node{$c}{child_of} = $p; $node{$p}{parent_of}{$c} = undef; } # begin the sorted output by looping over values that do not have pare +nts: for my $parent ( grep { !exists( $node{$_}{child_of} ) } keys %node ) +{ my $children = $node{$parent}{parent_of}; # ref to hash of child +values trace_down( $children, \%node ); } sub trace_down { my ( $kids, $tree ) = @_; for my $kid ( keys %$kids ) { print "$kid $$tree{$kid}{child_of}\n"; if ( exists( $$tree{$kid}{parent_of} )) { trace_down( $$tree{$kid}{parent_of}, $tree ); } } } __DATA__ A1-4100-YZX-002|A1-4100 A1-4100|A1 A1-4200-ABC-001|A1-4200 A1-4200|A1 A1-4100-YZX-002-01|A1-4100-YZX-002
        On the question of multiple "TOPS".... There will only ever be one "TOP" which is A1. The reason for this is that there are around 12 (I can't remember the exact number) "children" of A1 which are A1-4100, A1-4200, A1-4300 etc etc etc. Then you have (for example) "A1-4200-ABC-001". I already tried using "grep -w" (match only whole words) and using "sort" (I worked for a number of years on UNIX and still work with Linux) which I thought may fix things, however, the parent-to-child relationship does not follow strict numbering conventions, i.e. the sequence "A1-4200-ABC-001" may be a child of "A1-405-ABC-001-FF".
        The top two tiers of the hierarchy follow a numbering convention (A1, A1-4600 etc) and can be followed quite easily, however, the rest has been built by a person and not by any "logical" numbering system so using partial string matching will not work, unfortunately.

        I realize I am taking up a lot of your time with this and I do appreciate your input and patience with my coding inadequacies.
        SlackyB