tevolo has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have an XML file that I wish to parse and store information from some of the branches in arrays. I have tried several different approaches and none seem to do what I need. It is the url tag that I wish to store into different arrays.

Thanking anyone in advance for any insight

use LWP::Simple; use XML::Simple; use Data::Dumper; use strict; use warnings; my $parser = XMLin("c:\\temp\\data.xml"); print Dumper($parser); print $parser->{HttpSiteAddress} . "\n"; print $parser->{SiteAssets}->{asset}->{url}, "\n"; print $parser->{FragmentAssets}->{asset}->{url}->[1];

Produces the following output + error:

$VAR1 = { 'PublishInfo' => { 'nodeUrl' => [ { 'nodeId' => '27', 'url' => 'index.htm' }, { 'nodeId' => '256', 'url' => 'foia/index.htm' }, { 'nodeId' => '259', 'url' => 'foia/2006/index.ht +m' } ] }, 'siteStudioVersion' => '7.7.0.1', 'SiteAssets' => { 'asset' => { 'url' => '/ucm/groups/public/@web +sitestructure/documents/file/pghdr_left.jpg', 'dDocName' => 'pgHdr_left' } }, 'getPageUrl' => '/ucm/idcplg?IdcService=SS_GET_PAGE&', 'siteStudioBuild' => '9.0.0.506', 'FragmentAssets' => { 'asset' => [ { 'url' => '/ucm/fragments/we +b_header/css/nav_ie6fix.css' }, { 'url' => '/ucm/fragments/cs +s4footer/img/promotions/transparency.png' }, { 'url' => '/ucm/fragments/cs +s4footer/img/bg-box-menu.gif' } ] }, 'format' => '1.1', 'HttpRelativeWebRoot' => '/ucm/', 'siteId' => 'CFTC', 'DeclaredAssets' => {}, 'HttpSiteAddress' => 'http://dc2kd11.cftc.gov:7070/' };
http://dc2kd11.cftc.gov:7070/ /ucm/groups/public/@websitestructure/documents/file/pghdr_left.jpg Not a HASH reference at C:\apache\apache2.2\cache_proxy\cgi-bin\test3. +pl line 12.

The input file looks like this:

<?xml version="1.0" encoding="UTF-8"?> <SiteStudioManifest format="1.1" siteStudioVersion="7.7.0.1" siteStudioBuild="9.0.0.506" siteId="CFTC" HttpSiteAddress="http://dc2kd11.cftc.gov:7070/" HttpRelativeWebRoot="/ucm/" getPageUrl="/ucm/idcplg?IdcService=SS_GET_PAGE&amp;" > <FragmentAssets> <asset url="/ucm/fragments/web_header/css/nav_ie6fix.css"/> <asset url="/ucm/fragments/css4footer/img/promotions/transpare +ncy.png"/> <asset url="/ucm/fragments/css4footer/img/bg-box-menu.gif"/> </FragmentAssets> <PublishInfo> <nodeUrl nodeId="27" url="index.htm"/> <nodeUrl nodeId="256" url="foia/index.htm"/> <nodeUrl nodeId="259" url="foia/2006/index.htm"/> </PublishInfo> <DeclaredAssets> </DeclaredAssets> <SiteAssets> <asset dDocName="pgHdr_left" url="/ucm/groups/public/@websites +tructure/documents/file/pghdr_left.jpg"/> </SiteAssets> </SiteStudioManifest>

Replies are listed 'Best First'.
Re: Problem parsing XML into arrays using XML::Simple
by richb (Scribe) on Jan 28, 2011 at 15:11 UTC

    I would tackle that with XML::LibXML and use XPath to access the data you want. Here's how to grab the URLs out of the FragmentAssets and SiteAssets nodes.

    use strict; use warnings; use XML::LibXML; use Data::Dumper; my $parser = XML::LibXML->new({"encoding" => "utf-8"}); my $doc = $parser->parse_file("pm884843.xml"); my @paths = ( "/SiteStudioManifest/FragmentAssets/asset", "/SiteStudioManifest/SiteAssets/asset" ); for my $path (@paths) { my @url = map { $_->getAttributeNode("url")->getValue() } $doc->fi +ndnodes($path); print "URLs for $path:\n"; print "\t$_\n" for (@url); print "\n"; }
    Output:
    C:\autoworking\perl>perl pm884843.pl URLs for /SiteStudioManifest/FragmentAssets/asset: /ucm/fragments/web_header/css/nav_ie6fix.css /ucm/fragments/css4footer/img/promotions/transpare +ncy.png /ucm/fragments/css4footer/img/bg-box-menu.gif URLs for /SiteStudioManifest/SiteAssets/asset: /ucm/groups/public/@websitestructure/documents/file/pghdr_left +.jpg

    You could grab the HttpSiteAddress attribute value from the root SiteStudioManifest node in a similar fashion.

      I very strongly agree with this approach.   You see, you actually have a very well-known and universal problem:   you need to fetch particular well-known things out of an XML data structure.   You want to be able to do that generically, without creating (and having to constantly maintain) complex procedural code to do so.   “XPath expressions” are a perfect way to do that.

      So, say goodbye and good day (in this particular case) to XML::Simple.   Time to call in his big-brother.

      In fact, another technology that you might wish to look into is “XML style sheets (XSL),” which allow you to specify complete XML extractions and transformations, non-procedurally.   While the technology is, shall we say, “somewhat obfuscatory,” it is nonetheless very powerful and therefore worth study.   I need not bother to say that Perl has excellent support for it.   Of course it does... of course...

      Wow, thanks for that code. As is always the case minutes after posting to this site I came across a solution that works but now I will also try your code as well to see which works best. Thanks again

      This is what I found to work

      use LWP::Simple; use XML::Simple; use Data::Dumper; use strict; use warnings; my $xml = new XML::Simple (KeyAttr=>[]); my $parser = $xml->XMLin("c:\\temp\\data.xml"); print Dumper($parser); foreach my $e (@{$parser->{FragmentAssets}->{asset}}) { print $e->{url}, "\n"; print "\n"; }
Re: Problem parsing XML into arrays using XML::Simple
by dasgar (Priest) on Jan 28, 2011 at 15:01 UTC

    I just skimmed over your post and didn't look at your code and data very closely. (I apologize if I misread the problem in my haste.)

    However, it sounds to me like you'd be interested in checking out the ForceArray option for the XML::Simple module. You can force everything into arrays or only certain elements that you specify.

Re: Problem parsing XML into arrays using XML::Simple
by jethro (Monsignor) on Jan 28, 2011 at 14:53 UTC

    Looking at the Dumper output suggests that

    print $parser->{FragmentAssets}->{asset}->{url}->[1];

    should be

    print $parser->{FragmentAssets}->{asset}->[1]->{url};
Re: Problem parsing XML into arrays using XML::Simple
by Anonyrnous Monk (Hermit) on Jan 28, 2011 at 15:02 UTC

    Also, in case you don't know in advance whether there will be one or more entries for "asset", you might want to look into the ForceArray option.

    This might ease accessing elements, as you then wouldn't have to test first whether it's an array...