Ionizor has asked for the wisdom of the Perl Monks concerning the following question:

So my situation is this: I'm trying to parse an XML file using XML::Parser. The english explanation is rather bizarre and convoluted but here it is: What I'm trying to do is make a hash with the element name as a key, containing an array with one item for each element encountered, containing a hash with two keys - text and attributes, with the text value being the character data and the attributes value containing an unnamed hash that has the key and value pairs for the attribute in it.

To make that a little clearer, the data for an element that looks like this:
<requirement contactname="Joe Average">A power cord.</requirement>
would be like this:

$tagstack{"requirements"}[0] = { "text" => "A Power Cord", "attributes" => ["contactname"=>"Joe Average"], }


If you came across another <requirement> element like:
<requirement contactname="Jane Smith" contactnumber="555-1212">A node name</requirement>
it would get added to the structure like this:

$tagstack{"requirements"}[1] = { "text" => "A node name", "attributes" => ["contactname"=>"Jane Smith", "contactnumber" => "555 +-1212"], }

I'm thinking that what I should be doing is predefining my data structure but I'm not entirely sure how to do that without assigning values to the data structure when I start. I'd also prefer to be able to push/pop or shift/unshift the array instead of assigning it manually in a loop.

I've been searching through the Camel book (3rd Ed.) but haven't found what I'm looking for. Can anyone give me a page number or a quick tutorial? Help much appreciated. Thanks.

Replies are listed 'Best First'.
Re: Predefining complex data structures?
by broquaint (Abbot) on Jul 12, 2002 at 14:46 UTC
    You don't need to predefine your datastructure as it will be created as you insert the data, which you could do like so
    my %tagstack; # ... parsing code here push @{$tagstack{requirements}}, { # $tag = wherever the data is coming from text => $tag{cdata}, attributes => $tag{attribs} };
    If you're just using simple nested hashes then XML::Simple may be just the module for you, and if you continue to get muddled by references and reference syntax check out the perlreftut and perlref manpages.
    HTH

    _________
    broquaint

      Thanks! This was extremely helpful.
Re: Predefining complex data structures?
by dragonchild (Archbishop) on Jul 12, 2002 at 14:50 UTC
    First, don't pre-define. Let Perl do the auto-vivification for you. That's what it's there for. Especially because you don't know for certain what will be there, just what structure it will be in.

    Secondly, you want to do something along the lines of:

    # Caveat Lector - this is untested! foreach $tag (@tags) { push @{$tagstack{$tag->name}}, { text => $tag->value, attributes => { split /[\s=]+/, $tag->attributes }, }; }
    Obviously, I'm assuming that $tag is some object with three methods - name(), value(), and attributes(). These will have to return the appropriate values from the data source. Also, I'm assuming that attributes() will return everything within the tag definition other than the tag's name. If all you get is the text, you could do something like:
    foreach my $tag (@tags) { my ($name, $attributes, $value) = $tag =~ m#^<(\w+)\s+(\w+)?\s*>(.*?)</\1>$#; # Use $name, $attributes, and $value as per above }
    The regex could be tightened, but this is off-the-cuff.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: Predefining complex data structures?
by thraxil (Prior) on Jul 12, 2002 at 14:54 UTC

    first, i would change the data structure to:

    $tagstack{requirements}->[1] = { text => "A node name", attributes => {contactname => "Jane Smith", contactnumber => "555-1212" } }

    note that it is not a hash containing a reference to an array which contains hashrefs. also note the curly braces around the attributes hash instead of square braces.

    if XML::Simple is up to the task of parsing your XML, this is fairly straightforward. here's a little script showing how you would go about it:

    #!/usr/bin/perl -wT use strict; use XML::Simple; use Data::Dumper; my $data = XML::Simple::XMLin('./test.xml'); # show what we start with print Data::Dumper::Dumper($data); my %tagstack; my @temp; foreach my $h (@{$data->{requirement}}) { my %t; $t{text} = $h->{content}; delete $h->{content}; $t{attributes} = $h; push @temp, \%t; } $tagstack{requirement} = \@temp; # show the finished product print Data::Dumper::Dumper(\%tagstack);

    with test.xml being:

    <root> <requirement contactname="Joe Average">A power cord.</requirement> <requirement contactname="Jane Smith" contactnumber="555-1212">A node +name</requirement> </root>

    it gives the following output:

    $VAR1 = { 'requirement' => [ { 'contactname' => 'Joe Average', 'content' => 'A power cord.' }, { 'contactnumber' => '555-1212', 'contactname' => 'Jane Smith', 'content' => 'A node name' } ] }; $VAR1 = { 'requirement' => [ { 'text' => 'A power cord.', 'attributes' => { 'contactname' => 'Joe + Average' } }, { 'text' => 'A node name', 'attributes' => { 'contactnumber' => '5 +55-1212', 'contactname' => 'Jan +e Smith' } } ] };

    personally, i think the data structure that XML::Simple produces is more intuitive, but you've probably got a reason for wanting it in the format you do.

    anders pearson

      Okay, I've looked over the changes you suggested to the data structure. The only thing I wasn't quite clear on was what the -> arrow operator at the beginning does. I haven't seen it used in that particular way in Perl before. Admittedly I'm still less than 200 pages into the Camel book.

      Unfortunately though it is more intuitive, XML::Simple isn't quite enough to do what I need to do as it would be more complicated to reassemble the data in the <method> structure I provided in another part of this thread than it would be to just stick with XML::Parser. With XML::Parser I can use an if or a case to fire off different code for an <object> or <input> element so that I can apply formatting (that's all the object and input tags are for) without have to reassemble the strings.

      Thanks!

(jeffa) Re: Predefining complex data structures?
by jeffa (Bishop) on Jul 12, 2002 at 14:54 UTC
    I can't think of a single good reason to do this, as the data structure returned by XML::Simple should work for just about any need you have:
    $VAR1 = { 'requirement' => [ { 'contactname' => 'Joe Average', 'content' => 'A power cord.' }, { 'contactnumber' => '555-1212', 'contactname' => 'Jane Smith', 'content' => 'A node name' } ] };
    But, since you asked, how about this:
    use strict; use XML::Simple; my $data = do {local $/;<DATA>}; my $xml = XMLin($data,forcearray=>1); my $new; for my $req (@{$xml->{requirement}}) { my %temp; $temp{text} = delete $req->{content}; $temp{attributes} = [%$req]; push @{$new->{requirements}}, {%temp}; } __DATA__ <xml> <requirement contactname="Joe Average">A power cord.</requirement> <requirement contactname="Jane Smith" contactnumber="555-1212">A node +name</requirement> </xml>
    This produced the following data structure for me:
    $VAR1 = { 'requirements' => [ { 'text' => 'A power cord.', 'attributes' => [ 'contactname', 'Joe Average' ] }, { 'text' => 'A node name', 'attributes' => [ 'contactnumber', '555-1212', 'contactname', 'Jane Smith' ] } ] };
    UPDATE:
    Woah! Sorry Ionizor, i read your post and parsed XML::Parser as XML::Simple. Forgive me, that's what i get for trying to answer questions in the morning without the prerequisite cup 'o joe first. :) But yes, XML::Simple will parse that XML snippet you provided:
    $VAR1 = { 'method' => [ { 'object' => [ 'Properties', 'Do not use option Foo', 'Server Name', 'OK' ], 'content' => [ 'Open up the ', ' page. Then uncheck the ', ' checkbox. Under ', ' enter ', ' and then hit ' ], 'input' => [ 'www.example.com' ] } ] };
    But that probably will not work for you. :(

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      At the moment I'm using XML::Parser because I wasn't sure if XML::Simple would correctly handle things like:

      <method>Open up the <object>Properties</object> page. Then uncheck the <object>Do not use option Foo</object> checkbox. Under <object>Server Name</object> enter <input>www.example.com</input> and then hit <object>OK</object></method>

      Which I will be processing a little later on in the script.
Re: Predefining complex data structures?
by demerphq (Chancellor) on Jul 12, 2002 at 15:00 UTC
    First off you cant "predefine" your data structure in Perl. There is no means by which you can explicitly specify your data structure. Instead Perl provides for easy ways to implicitly define your data structure. As well as interact with it and redefine it on the fly.

    So how to do some of the things you want to do...

    my %struct; foreach my $elem (@elements) { # loop over all the elements # check to make sure our hash key has an array we can push onto. $struct{$elem->name}=[] unless $struct{$elem->name}; # now create a new sub hash to push onto the array later my %hash=(text=>$elem->text, attributes=>[]); #initialize it # loop over each attribute in the element foreach my $attrib ($elem->attribs) { # push the elements onto the attributes array push @{$hash{attributes}},$attrib->name,$attrib->value; } # push a reference to the newly created hash on the array stored for + this element type. push @{$struct{$elem->name}},\%hash; }
    Now of course you will have to figure out how to convert the pseudo methods ive used here into the real thing. Also, iirc XML does not allow for two attributes of the same name in one tag, so instead of using an array to store them just use a hash (unless of order is important).

    HTH

    UPDATE The line where I put an array in explicitly is not needed in this scenario, but if we code like

    push @{$hash{$key}},'var' unless @{$hash{$key}}>5;
    we would, because autovivification doesnt happen in that context. Sorry. And just now in the CB chip pointed out that changing the condition to
    push @{$hash{$key}},'var' unless @{$hash{$key}||[]}>5;
    would also do the trick, and is probably more elegant, if a touch obfu'd. Thanks chip.

    Yves / DeMerphq
    ---
    Writing a good benchmark isnt as easy as it might look.

      Thanks for the nod, demerphq, but I think I can do it my hack one better:

      for ($hash{$key}) { push @$_, 'var' unless @$_ > 5 }

          -- Chip Salzenberg, Free-Floating Agent of Chaos

        I understand the rest of the snippet but I'm missing the signficance of the 5. Why 5?
      Heh... I put square brackets instead of braces in my example. I did mean for the attributes to be in a hash rather than an array. Oops!
Re: Predefining complex data structures?
by djantzen (Priest) on Jul 12, 2002 at 14:50 UTC

    Why do you feel that this needs to be predefined? Or a better question is, how would you go about doing it given that the structure in its nature is open ended? That is to say, the array reference pointed to by the "requirements" key has no built-in limit, nor does there appear to be a maximum number of attributes in the embedded hash. Thus, if you were to predefine it in some way, you'd have to choose an arbitrary depth to which to do so.

    As far as push()ing and shift()ing and what not is concerned, this should do it:

    push(@{$tagstack{requirements}}, { text => 'foo', attributes => [ 'wha +tever' ] } );

      What I had meant by predefining is just to predefine the structure, not the data itself. If I know the structure of the data, defining an appropriate perl structure to hold it shouldn't be that hard.

      I was having difficulty with the push because it kept trying to push into a hash (d'oh!) and I didn't know the correct syntax to fix it. Thanks!

(dkubb) Re: (2) XML parsing and SAX event handlers
by dkubb (Deacon) on Jul 13, 2002 at 10:13 UTC

    Many of the approaches in this thread centered around using XML::Simple. Why not try using XML::SAX and build your own SAX event handler. I believe it can satify your requirements while at the same time providing more flexibility than XML::Parser's interface.

    A good introduction to creating SAX event handlers can be found at XML::SAX::Intro in the XML::SAX distribution on CPAN.

    To address you're question here's a working example:

    #!/usr/bin/perl -wT use strict; use XML::SAX; use Data::Dumper qw(DumperX); my $handler = My::SAXParser->new; my $parser = XML::SAX::ParserFactory->parser(Handler => $handler); #pass the XML document at the bottom __DATA__ tag to the parser $parser->parse_string(do { local $/; <DATA> }); print DumperX($handler->nodes); { #this class keeps track of the processed nodes package My::SAXParser; use strict; use base qw(XML::SAX::Base); use Class::MethodMaker get_set => ['nodes'], list => ['element_stack']; use constant SKIP_NODE => 'xml'; sub start_document { shift->nodes({}) } sub start_element { my $self = shift; my $el = shift; return if $el->{Name} eq SKIP_NODE; #make note of which element we are processing - in the stack $self->element_stack_push(\my %element); foreach my $attribute (values %{$el->{Attributes}}) { push @{$element{attributes}}, @$attribute{qw(Name Value)}; } #keep track of all interesting element nodes push @{ $self->nodes->{$el->{Name}} }, \%element; return $self->SUPER::start_element($el); } sub characters { my $self = shift; return unless $self->element_stack_count; #are there any pending +element nodes to process? return $self->SUPER::characters($self->element_stack->[-1]->{text} + .= shift->{Data}); } sub end_element { my $self = shift; $self->element_stack_pop; #element has been processed, pop it off + the stack return $self->SUPER::end_element(shift); } } __DATA__ <xml> <requirement contactname="Joe Average">A power cord.</requirement> <requirement contactname="Jane Smith" contactnumber="555-1212">A node +name</requirement> </xml>

    This should produce the following output:

    $VAR1 = { 'requirement' => [ { 'text' => 'A power cord.', 'attributes' => [ 'contactname', 'Joe Average' ] }, { 'text' => 'A node name', 'attributes' => [ 'contactnumber', '555-1212', 'contactname', 'Jane Smith' ] } ] };

    I tested this code with the other XML document example you posted in this thread. It can parse it and I believe it produces a pretty reasonable output.

    Also if performance is an issue it's possible to gain further speed increases using XML::LibXML::SAX::Parser or XML::SAX::Expat. Either of these modules can pretty much just be dropped into the above script by modifying two lines of the script's code: the use and new constructor statements.

      Either of these modules can pretty much just be dropped into the above script by modifying two lines of the script's code

      Actually, it shouldn't be necessary to modify the code at all. Your sample code uses XML::SAX::ParserFactory which will use the system default SAX parser (as defined in lib/XML/SAX/ParserDetails.ini). So if you install XML::SAX::Expat, your script will immediately make use of it.

      I found the SAX documentation rather confusing the first time I read it over so I put it down for a while. Now I've picked it back up and with a little help from O'Reilly's Perl and XML I'm recoding into XML::SAX.

      On a related note, I highly recommend O'Reilly's Safari service. Online books! It's very cool.