audioboxer has asked for the wisdom of the Perl Monks concerning the following question:

Hi Everyone, I'm parsing XML data using xml:libxml and am trying to determine the last category based on "category_id" and "parent_id" but can't figure out the logic to do it through hashes or arrays. The XML looks like this where the last category should be "Carrot"
<products> <product> <categories> <category> <category_id>443120</category_id> <name>Vegetables</name> <parent_id>443026</parent_id> </category> <category> <category_id>445720</category_id> <name>Carrot</name> <parent_id>443120</parent_id> </category> <category> <category_id>443026</category_id> <name>Food</name> </category> </categories> </product> <products>
I've put it into a hash like this but can't figure out how to loop through the hash and determine the last category.
$VAR1 = { 'Vegetables' => { 'id' => '443120', 'parent_id' => '443026' }, 'Food' => { 'parent_id' => '', 'id' => '443026' }, 'Carrot' => { 'parent_id' => '443120', 'id' => '445720' } };
This is my current code
foreach my $part ($dom->findnodes('/products/product')) { foreach my $categories($part->findnodes('./categories/category')) +{ $cat_name = $categories->findvalue("name"); $cat_id = $categories->findvalue("category_id"); $cat_parent_id = $categories->findvalue("parent_id"); $catdata{$cat_name} = {"id" => $cat_id, "parent_id" => $cat_pa +rent_id}; } }

Replies are listed 'Best First'.
Re: Hash or Array - Logic Help
by kcott (Archbishop) on May 30, 2020 at 05:44 UTC

    G'day audioboxer,

    How you intend to use the data should really drive your choices for data structures. Unfortunately, you've only shown how you currently create the data structure but not its usage. Probably the most important thing to note — and you may already know this — is that arrays are ordered and hashes are unordered.

    Other general considerations include the size of the data and whether the data structure ever changes after initial creation. You should also be asking specific questions; such as "Do you only ever want to access the last category or do you perhaps need a sorted list of all categories?".

    Here's a technique for sorting the keys based on "id" values:

    $ perl -E ' my %x = (A => { id => 2 }, B => { id => 3 }, C => { id => 1 }); my @sorted = sort { $x{$a}{id} <=> $x{$b}{id} } keys %x; say "Last: $sorted[-1]"; ' Last: B

    Given it looks like IDs are referenced more than NAMEs, you might consider changing the general layout of the structure to:

    ID => { name => '...', parent_id => ..., }

    I'd also suggest taking a look at the core List::Util module. Some of its functions may prove useful; for example, max (for IDs) and first (with descending sort) seem to be likely candidates.

    — Ken

Re: Hash or Array - Logic Help
by bliako (Abbot) on May 30, 2020 at 13:52 UTC

    The script has been updated (~7 hours after posting) see note below.

    the bottom line is 1) whether your XML content provides element-order information (see for example https://www.w3schools.com/xml/schema_complex_indicators.asp), 2) whether your XML parser understands that element-ordering hint and respects it and 3) whether you are using a Perl data structure which respects order, for example an array.

    If your interpretation of last means the last element in a group, which I guess it is the most widely accepted one, then you lost #3 because you are not using an array. If by last you mean last element when sorted 1) alphabetically keyed on category name or 2) numerically keyed on category id or 3) alphabetically on category name translated to greek, ... N) ... you understand you need to be more specific as there are lots of interpretations.

    If you must use a hash because you need to do fast (O(1)) retrieval by category-name or category-id then you must insert an additional entry into the hash of each category: order containing the order it was parsed. This would be equivalent to the array index.

    If you need to do fast searches by name, id or index then you can use 3 different hashes for that, each keyed on a different attribute. It will be fast but will cost on memory.

    You can have the best of both worlds by using references.

    At this point the penny dropped and I realised that by last you mean the leaf/child category! The ... Carrot! This problem can easily be solved by using an Nary Tree. That is a tree whose nodes can have many children (contrast the Bin-ary tree which can have at most 2 children).

    Below is an example of how to create categories in a hierarchy, insert them into the tree and then find the "last" categories which are leaf nodes of the tree, childless nodes that is. Added bonus: the most tiny and intuitive Nary-Tree module because I could not find what I wanted from CPAN.

    UPDATE: I realised that my example was not covering your use-case, where you want to insert a node to the tree under the node with the specific parent_id. So, I have added a find_node_by_id($id) which will find the parent node and then add the new node as a child. This is demonstrated at the end of the script.

    # by bliako # for https://perlmonks.org/?node_id=11117492 # date: 30/05/2020 # The tiniest and *most intuitive* Nary Tree package # to store a grocery { package Tree::Nary::Tiny; use strict; use warnings; sub new { my $class = $_[0]; my $parent = $_[1]; my $id = $_[2]; my $value = $_[3]; my $print_value_sub = $_[4]; # optional $print_value_sub = sub { return $_[0] } unless defined $print_valu +e_sub; my $self = { '_parent' => $parent, 'id' => $id, 'v' => $value, 'pvs' => $print_value_sub, 'c' => [], '_depth' => 0 }; bless $self => $class; if( defined $parent ){ push @{$parent->{'c'}}, $self; $self->{'_depth'} = $parent->{'_depth'} + 1; } return $self } sub leaf_nodes { my $self = $_[0]; my @ret; for(@{$self->{'c'}}){ my $aret = $_->leaf_nodes(); push(@ret, @$aret) if defined $aret; } return scalar @ret ? \@ret : [$self] } sub parent { return $_[0]->{'_parent'} } sub value { return $_[0]->{'v'} } sub depth { return $_[0]->{'_depth'} } sub id { return $_[0]->{'id'} } sub children { return $_[0]->{'c'} } sub value_stringify { return $_[0]->{'pvs'}->($_[0]) } sub add_child { return Tree::Nary::Tiny->new( $_[0], # parent is our self $_[1], # id $_[2], # optional value/data # optional print value sub or inherit from parent defined($_[3]) ? $_[3] : $_[0]->{'pvs'} ); } sub stringify { my $self = $_[0]; my $v = $self->{'v'}; my $dep = $self->depth(); my $ret = " " x $dep . $self->{'id'} . " (depth $dep) :"; if( defined $v ){ $ret .= $self->{'pvs'}->($self) } else { $ret .= + '<na>' } $ret .= "\n".$_->stringify() for @{$self->children()}; return $ret; } # NEW: added traverse() (depth-first) and find_node_by_id() # traverse all nodes of below the 'self' node (which can be the root n +ode) # param: sub to execute each time a node is visited with parameter the + node # if that sub returns 0 then the traverse is interrupted. sub traverse { my $self = $_[0]; my $func = $_[1]; $func->($self); for my $achild (@{ $self->children() }){ # traverse() can be interrupted if func() returns 0 return(0) unless $achild->traverse($func); } return 1; } # search all the nodes below 'self' (this can be the root node, i.e. t +he whole tree) # for the node with specified id # this is an example of using traverse() sub find_node_by_id { my $self = $_[0]; my $id = $_[1]; my $found; $self->traverse(sub { my $node = $_[0]; # interrupt traverse() if node's id matches if( $id eq $node->id() ){ $found = $node; return 0 } return 1; # not found }); return $found; } 1; } # end package Tree::Nary::Tiny ###### # Main ###### use strict; use warnings; # sub to be called on each node when we want to stringify it # and knows how to handle the user-defined 'value' each node holds. sub node_stringify { my $node = shift; return $node->value()->{'name'} ." id=".$node->value()->{'id'} ." parent=".(defined($node->parent()) ? $node->parent()->id() : "< +noparent>") } # create the tree my $T = Tree::Nary::Tiny->new( undef, # no parent 0, # id {name=>"i am the tree holding your categories",id=>0}, # data \&node_stringify # sub to print each node ); my ($cat0, $cat1, $cat2, $node0, $node1, $node2); for my $i (1..2){ $cat0 = "$i"; # add this set of nodes to the root node of the tree, they are sup +er-categories # @audioboxer 'Food', 'Drink', etc. $node0 = Tree::Nary::Tiny->new($T, $cat0, {name=>"my id is $cat0 - + at level 0", id => $cat0}, \&node_stringify) or die "node0:"; for my $j (1..2){ $cat1 = "$i.$j"; # 2nd level nodes, @audioboxer 'Vegetables', 'Fizzy' etc. $node1 = Tree::Nary::Tiny->new($node0, $cat1, {name=>"my id is + $cat1 - at level 1", id => $cat1}, \&node_stringify) or die "node1"; for my $k (1..2){ $cat2 = "$i.$j.$k"; # 3rd level nodes, @audioboxer 'Carrot', 'Sodaxyz' $node2 = Tree::Nary::Tiny->new($node1, $cat2, {name=>"my i +d is $cat2 - at level 2", id => $cat2}, \&node_stringify) or die "nod +e2:"; } } } # print the tree and love recursion print $T->stringify()."\n"; print qq{\n\nThe """last""" category is....\n}; # find all the "last" categories, which are the leaf nodes, # childless nodes. my $leafs = $T->leaf_nodes(); print $_->stringify()."\n" for @$leafs; # NEW: # insert a new node under existing node (find by id) my $newid = '2.1.20'; print "Inserting a new node with id $newid, under '2.1'...\n"; my $nodefound = $T->find_node_by_id('2.1'); $nodefound->add_child($newid, {name=>"new inserted node", id=>$newid}) +; print $T->stringify()."\n";

    bw, bliako

      Very interesting:

      $ ./1.1.nary.pl 0 (depth 0) :i am the tree holding your categories id=0 parent=<nopare +nt> 1 (depth 1) :my id is 1 - at level 0 id=1 parent=0 1.1 (depth 2) :my id is 1.1 - at level 1 id=1.1 parent=1 1.1.1 (depth 3) :my id is 1.1.1 - at level 2 id=1.1.1 parent= +1.1 1.1.2 (depth 3) :my id is 1.1.2 - at level 2 id=1.1.2 parent= +1.1 1.2 (depth 2) :my id is 1.2 - at level 1 id=1.2 parent=1 1.2.1 (depth 3) :my id is 1.2.1 - at level 2 id=1.2.1 parent= +1.2 1.2.2 (depth 3) :my id is 1.2.2 - at level 2 id=1.2.2 parent= +1.2 2 (depth 1) :my id is 2 - at level 0 id=2 parent=0 2.1 (depth 2) :my id is 2.1 - at level 1 id=2.1 parent=2 2.1.1 (depth 3) :my id is 2.1.1 - at level 2 id=2.1.1 parent= +2.1 2.1.2 (depth 3) :my id is 2.1.2 - at level 2 id=2.1.2 parent= +2.1 2.2 (depth 2) :my id is 2.2 - at level 1 id=2.2 parent=2 2.2.1 (depth 3) :my id is 2.2.1 - at level 2 id=2.2.1 parent= +2.2 2.2.2 (depth 3) :my id is 2.2.2 - at level 2 id=2.2.2 parent= +2.2 The """last""" category is.... 1.1.1 (depth 3) :my id is 1.1.1 - at level 2 id=1.1.1 parent= +1.1 1.1.2 (depth 3) :my id is 1.1.2 - at level 2 id=1.1.2 parent= +1.1 1.2.1 (depth 3) :my id is 1.2.1 - at level 2 id=1.2.1 parent= +1.2 1.2.2 (depth 3) :my id is 1.2.2 - at level 2 id=1.2.2 parent= +1.2 2.1.1 (depth 3) :my id is 2.1.1 - at level 2 id=2.1.1 parent= +2.1 2.1.2 (depth 3) :my id is 2.1.2 - at level 2 id=2.1.2 parent= +2.1 2.2.1 (depth 3) :my id is 2.2.1 - at level 2 id=2.2.1 parent= +2.2 2.2.2 (depth 3) :my id is 2.2.2 - at level 2 id=2.2.2 parent= +2.2 Inserting a new node with id 2.1.20, under '2.1'... 0 (depth 0) :i am the tree holding your categories id=0 parent=<nopare +nt> 1 (depth 1) :my id is 1 - at level 0 id=1 parent=0 1.1 (depth 2) :my id is 1.1 - at level 1 id=1.1 parent=1 1.1.1 (depth 3) :my id is 1.1.1 - at level 2 id=1.1.1 parent= +1.1 1.1.2 (depth 3) :my id is 1.1.2 - at level 2 id=1.1.2 parent= +1.1 1.2 (depth 2) :my id is 1.2 - at level 1 id=1.2 parent=1 1.2.1 (depth 3) :my id is 1.2.1 - at level 2 id=1.2.1 parent= +1.2 1.2.2 (depth 3) :my id is 1.2.2 - at level 2 id=1.2.2 parent= +1.2 2 (depth 1) :my id is 2 - at level 0 id=2 parent=0 2.1 (depth 2) :my id is 2.1 - at level 1 id=2.1 parent=2 2.1.1 (depth 3) :my id is 2.1.1 - at level 2 id=2.1.1 parent= +2.1 2.1.2 (depth 3) :my id is 2.1.2 - at level 2 id=2.1.2 parent= +2.1 2.1.20 (depth 3) :new inserted node id=2.1.20 parent=2.1 2.2 (depth 2) :my id is 2.2 - at level 1 id=2.2 parent=2 2.2.1 (depth 3) :my id is 2.2.1 - at level 2 id=2.2.1 parent= +2.2 2.2.2 (depth 3) :my id is 2.2.2 - at level 2 id=2.2.2 parent= +2.2 $
Re: Hash or Array - Logic Help
by haukex (Archbishop) on May 30, 2020 at 19:51 UTC

    Here's how I might have coded this. The trick here is to remember that when you build deeply nested data structures, each element is a reference to an anonymous hash or array, and you can keep multiple references to the same anonymous data structure (perlreftut, perldsc, perlref). So, I additionally keep references to each node in the temporary structure %nodes so I can easily access them by ID, no matter where they are in the tree. I do make the assumption that the data in the XML is structured correctly: that IDs are unique, each parent_id references an existing ID, that there are no duplicate IDs, that there are no stray nodes, no cycles, and so on.

    use warnings; use strict; use XML::LibXML; my $dom = XML::LibXML->load_xml({ location => '11117492.xml' }); # Build the tree my (%nodes,%tree); for my $part ($dom->findnodes('/products/product')) { for my $categories ($part->findnodes('./categories/category')) { my $cat_name = $categories->findvalue("name"); my $cat_id = $categories->findvalue("category_id"); my $cat_parent_id = $categories->findvalue("parent_id"); $nodes{$cat_id}{id} = $cat_id; $nodes{$cat_id}{name} = $cat_name; if ( length $cat_parent_id ) { push @{$nodes{$cat_parent_id}{children}}, $nodes{$cat_id}; } else { $tree{$cat_id} = $nodes{$cat_id} } # root node } } use Data::Dump; dd \%tree; # Debug # Walk the tree to find leaves my @queue = values %tree; while ( @queue ) { my $elem = shift @queue; if ( $elem->{children} ) { push @queue, @{$elem->{children}}; } else { print $elem->{name}," (",$elem->{id},")\n"; } }

    Output:

    { 443026 => { children => [ { children => [{ id => 445720, name => "Carrot" }], id => 443120, name => "Vegetables", }, ], id => 443026, name => "Food", }, } Carrot (445720)

    Update: Expanded explanation slightly.

Re: Hash or Array - Logic Help
by Anonymous Monk on May 30, 2020 at 10:05 UTC

    Hi Everyone, I'm parsing XML data using xml:libxml and am trying to determine the last category based on "category_id" and "parent_id" but can't figure out the logic to do it through hashes or arrays. The XML looks like this where the last category should be "Carrot"

    Hi

    So why should carrot be last?

    No xml is required for that logic ;)

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; use constant +{ qw{ CAT_ID 0 NAME 1 PARENT_ID 2 } }; Main( @ARGV ); exit( 0 ); sub last_category(;+) { ( sort { $a->[CAT_ID] <=> $b->[CAT_ID] } @{ $_[0] } )[ -1 ] } sub Main { my @cats = ( [443120, "Vegetables", 443026], [445720, "Carrot", 443120], [443026, "Food", undef ], ); dd \@cats; dd( last_category( \@cats ) ); @cats = sort { $a->[CAT_ID] <=> $b->[CAT_ID] } @cats; dd\@cats; } __END__ [ [443120, "Vegetables", 443026], [445720, "Carrot", 443120], [443026, "Food", undef], ] [445720, "Carrot", 443120] [ [443026, "Food", undef], [443120, "Vegetables", 443026], [445720, "Carrot", 443120], ]

    Although once you know the goal ... xsh probably implements a max/min function

    Yup https://xsh.sourceforge.io/doc2/html/index.html#max_function

    $ xsh -q xpath-max-min-11117492-correct.xsh <category_id>443120 </category_id> <category_id>445720 </category_id> <category_id>443026 </category_id> 443120 445720 443026 max: 445720 min: 443026

    $ cat xpath-max-min-11117492-correct.xsh open "xpath-max-min-11117492.xml"; ls //category_id; ls //category_id/node(); echo max: xsh:max( //category_id ); echo min: xsh:min( //category_id );
      Note that the sourceforge repo and documentation are no longer up to date. You can find the recent changes at GitHub.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Where is the page I linked?
A reply falls below the community's threshold of quality. You may see it by logging in.