MarkovChain has asked for the wisdom of the Perl Monks concerning the following question:

Hello Venerable Monks,

I know this topic has been bought up before and I am probably being criminal here by being repetitious. But I have spent a bit of time on my code here and I fail to see my oversight.... I would appreciate if someone could help me out here.

I have an xml file that I am trying to parse via XPath expressions. I had initially thought about using XML::Twig but since my parsing is never going to be huge, I am sticking with XML::LibXML; especially because I need to be validating my input with a local schema prior to the actual parsing.

The xml file is:

<?xml version="1.0" encoding="UTF-8"?> <instruction_request xmlns="http://www.somedomain.tld/market_reg/admin +_server/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.somedomain.tld/market_reg/admin_ser +ver/1.0 admin_server.xsd"> <request> <action_request> <action>START</action> <instance_information> <application_type> SomeApp </application_type> <application_name> app_name </application_name> <environment>QC</environment> <application_host_ip>10.10.24.56</application_host_ip> <application_host_name> somehost.domain_name.tld </app +lication_host_name> </instance_information> </action_request> </request> <request> <action_request> <action>STOP</action> <instance_information> <application_type> SomeApp2 </application_type> <application_name> instance_name </application_name> <environment>QC</environment> <application_host_ip>172.16.24.56</application_host_ip +> <application_host_name> somehost.domain_name.tld </app +lication_host_name> </instance_information> </action_request> </request> <request> <info_request> <action>GET_INFO</action> </info_request> </request> </instruction_request>

Now my parsing code is:

use strict; use warnings; use XML::LibXML; use Data::Dumper; use XML::LibXML::XPathContext; use strict; use warnings; use XML::LibXML; use Data::Dumper; use XML::LibXML::XPathContext; ############################################################ ######## VARIABLE DECLARATION ######## ############################################################ my $file = '/Volumes/UserData/Users/dattanik/Programs/XML/test_xml.xml'; ############################################################ ######## MAIN PROGRAM ######### ############################################################ #---------------------------------------------------------- # Create a new parser. #---------------------------------------------------------- my $parser = XML::LibXML->new (); #---------------------------------------------------------- # Parse the XML document. #---------------------------------------------------------- my $dom = $parser->load_xml (location => $file); #print ref ($dom) , "\n"; # XML::LibXML::Document #---------------------------------------------------------- # Get the document root. #---------------------------------------------------------- my $root = $dom->getDocumentElement (); #print ref($root), "\n"; # XML::LibXML::Element #---------------------------------------------------------- # Set the context (current) node. #---------------------------------------------------------- $xpc->setContextNode($root); print "Context node set as |". $xpc->getContextNode->nodeName() . "|\n +"; #---------------------------------------------------------- # Make sure you can access data. #---------------------------------------------------------- my $action_requests = $xpc->find('//*', $root); print ref ($action_requests) . "\n"; print 'Size of action requests is ' . $action_requests->size() . "\n"; print 'String value of action requests is "' . $action_requests->strin +g_value() . "\"\n"; print 'To literal of action requests is "' . $action_requests->string_ +value() . "\"\n"; print 'To literal of action requests is "' . $action_requests->get_nod +e(4) . "\"\n"; print 'To element name is "' . $action_requests->get_node(4)->nodeName + (). "\"\n"; print 'To node value is "' . $action_requests->get_node(4)->textConten +t() . "\"\n"; #---------------------------------------------------------- # Get all action requests. # =====> THIS IS WHERE IT GETS FUNKY <========= #---------------------------------------------------------- print '-' x 80 . "\n"; print 'Trying to get the requests directly with an XPath expression .. +.' . "\n"; print '-' x 80 . "\n"; my $action_requests = $xpc->find('//request'); print ref($action_requests) . "\n"; print 'Size of action requests is ' . $action_requests->size() . "\n"; print 'String value of action requests is "' . $action_requests->string_value() . "\"\n"; print 'To literal of action requests is "' . $action_requests->string_value() . "\"\n"; print 'To literal of action requests is "' . $action_requests->get_node(4) . "\"\n"; print 'To element name is "' . $action_requests->get_node(4)->nodeName() . "\"\n"; print 'To node value is "' . $action_requests->get_node(4)->textContent() . "\"\n";

The output is:

$perl newxmltest.pl request: action_request: action: instance_information: application_typ +e: application_name: environment: application_host_ip: application_ho +st_name: request: action_request: action: instance_information: appli +cation_type: application_name: environment: application_host_ip: appl +ication_host_name: request: info_request: action: XML::LibXML::Elemen +t XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element XML::LibXML::Element Context node set as |instruction_request| XML::LibXML::NodeList Size of action requests is 22 String value of action requests is " START SomeApp app_name QC 10.10.24.56 somehost.domain_name.tld STOP SomeApp2 instance_name QC 172.16.24.56 somehost.domain_name.tld GET_INFO " To literal of action requests is " START SomeApp app_name QC 10.10.24.56 somehost.domain_name.tld STOP SomeApp2 instance_name QC 172.16.24.56 somehost.domain_name.tld GET_INFO " To literal of action requests is "XML::LibXML::Element=SCALAR(0x81c770 +)" To element name is "action" To node value is "START" ----------------------------------------------------------- Trying to get the requests directly with an XPath expression ... ----------------------------------------------------------- XML::LibXML::NodeList Size of action requests is 0 String value of action requests is "" To literal of action requests is "" Use of uninitialized value in concatenation (.) or string at newxmltes +t.pl line 127. To literal of action requests is "" Can't call method "nodeName" on an undefined value at newxmltest.pl li +ne 129.

NOTE: THE OUTPUT HAS BEEN EDITED FROM A FORMATTING PERSPECTIVE .

I am unable to use XPath for instance //request in find / findnodescall's. I have seen examples around the net that claim xpath expresssions work and give examples but they have not worked for me. I am using the latest version of XML::LibXML (version 1.70).

Thank You!!

Replies are listed 'Best First'.
Re: XML::LibXML - parsing question!!
by ikegami (Patriarch) on Dec 18, 2009 at 21:28 UTC

    You're asking to match "request" elements in the null namespace, but the closest your XML contains are "request" elements in the "http://www.somedomain.tld/market_reg/admin_server/1.0" namespace.

    What you do is create is a prefix, associate it with that namespace, and use that prefix in the XPath.

    use strict; use warnings; use XML::LibXML qw( ); use XML::LibXML::XPathContext qw( ); ( my $file = $0 ) =~ s/\.pl\z/.xml/i; my $parser = XML::LibXML->new (); my $dom = $parser->load_xml( location => $file ); my $root = $dom->getDocumentElement(); my $xpc = XML::LibXML::XPathContext->new($root); $xpc->registerNs( p => 'http://www.somedomain.tld/market_reg/admin_server/1.0' ); for my $node ( $xpc->findnodes('//p:request/*/p:action') ) { print $node->textContent(), "\n"; }
    START STOP GET_INFO

    Note that "p" is arbitrary. You can use something more meaningful. Or not.

    By the way, you never have to call setContextNode. You can simply pass the context node as the second argument to find* as the following demonstrates:

    use strict; use warnings; use XML::LibXML qw( ); use XML::LibXML::XPathContext qw( ); ( my $file = $0 ) =~ s/\.pl\z/.xml/i; my $parser = XML::LibXML->new (); my $dom = $parser->load_xml( location => $file ); my $root = $dom->getDocumentElement(); my $xpc = XML::LibXML::XPathContext->new($root); $xpc->registerNs( p => 'http://www.somedomain.tld/market_reg/admin_server/1.0' ); for my $req_node ( $xpc->findnodes('//p:request') ) { for my $action_node ( $xpc->findnodes('*/p:action', $req_node) ) { print $action_node->textContent(), "\n"; } }

      Ahh my friend Ikegami...

      I had tried with the namespace ... will try it again...

      Yay it worked.... I realized my mistake... although I had registered the namespace, I was only associating the namespace with the first element and naively assumed that it would get implicitly associated with child / siblings.

      Hats off Sir!!

        I was only associating the namespace with the first element

        Do you mean you "only specified a prefix for the first node test in the XPath"? There's no way to not specify a namespace for a node test in an XPath, so there's no opportunity to default to anything.

        Hmm I've been trying to validate the xml with my schema and my schema validation works perfectly in my xml editor (I'm using Oxygen XML Editor... not that it matters).

        The validation fails when using XML::LibXML::Schema.

        #---------------------------------------------------------- # Validate against XML schema. #---------------------------------------------------------- my $xml_schema = XML::LibXML::Schema->new( location => '/Users/dattanik/Programs/XML/admin_server.xsd' ); eval { $xml_schema->validate($dom); }; if ($@) { print "Looks like schema validation has failed. \n"; print "$@", "\n"; } else { print "XML Schema validation succeeded.\n"; }

        I have a primitive facet for validating IP addresses...(\d{1,3}\.){3}\d{1,3} that's embedded in my XML Schema. I am also collapsing the embedded spaces in the application_host_ip element.

        Nothing special here but it fails there when validating thru XML::LibXML::Schema.... the validation of the exact same XML document is successful using my schema. My relevant schema code is:

        <xs:element name="application_host_ip"> <xs:annotation> <xs:documentation>The IP address of the host on wh +ich the application is executing.</xs:documentation> </xs:annotation> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="(\d{1,3}\.){3}\d{1,3}"> <xs:annotation> <xs:documentation>This regular express +ion does a rudimentary check on the IP address.</xs:documentation> </xs:annotation> </xs:pattern> <xs:maxLength value="15"/> <xs:minLength value="7"/> <xs:whiteSpace value="collapse"/> </xs:restriction> </xs:simpleType> </xs:element>

        Now I checked the support docs in XML::LibXML::Schema but it does not seem to provide any additional info besides the class interface.

        The relevant output from STDOUT is printed...

        Looks like schema validation has failed Element 'application_host_ip' [ST local, facet 'pattern']: The value ' +123.123.24.56' is not accepted by the pattern '(\d{1,3}\.){3}\d{1,3}' +. Element 'application_host_ip' [ST local, facet 'pattern']: The value ' +123.123.24.56' is not accepted by the pattern '(\d{1,3}\.){3}\d{1,3}' +.

        I am doing all this as part of my pet project to monitor a myriad of applications running across my production network. These guys have log4j writers and currently the maintenance folks are individually logging into those boxes via ssh and running standard maintenance procedures. I am automating this process by writing a POE server that listens on these log4j and presents status info to a web client. The web user agent can then send requests in standard XML format that is interpreted by the server and it does the maintenance tasks... much more reliable. This is the last part :)....

Re: XML::LibXML - parsing question!!
by Jenda (Abbot) on Dec 21, 2009 at 11:38 UTC
    I am sticking with XML::LibXML; especially because I need to be validating my input with a local schema prior to the actual parsing.

    Then validate the input prior to actual parsing and do not let the choice of validator affect your choice of parser (extractor).

    if you want to do something with the action_request/info_requoest right away:

    use strict; use XML::Rules; my $parser = XML::Rules->new( stripspaces => 7, namespaces => { 'http://www.somedomain.tld/market_reg/admin_server/1.0' => '', 'http://www.w3.org/2001/XMLSchema-instance' => 'xsi', }, rules => { _default => 'content', instance_information => 'as is', 'action_request,info_request' => sub { my ($tag,$attr) = @_; print $attr->{action}, "\n"; while ( my ($k,$v) = each %{$attr->{instance_information}} +) { print " $k: $v\n"; } print "\n"; return; }, }, ); $parser->parse(\*DATA); __DATA__ <?xml version="1.0" encoding="UTF-8"?> <instruction_request xmlns="http://www.somedomain.tld/market_reg/admin +_server/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.somedomain.tld/market_reg/admin_ser +ver/1.0 admin_server.xsd"> <request> ...

    if you want to extract the data and just the data:

    use strict; use XML::Rules; my $parser = XML::Rules->new( stripspaces => 7, namespaces => { 'http://www.somedomain.tld/market_reg/admin_server/1.0' => '', 'http://www.w3.org/2001/XMLSchema-instance' => 'xsi', }, rules => { _default => 'content', instance_information => 'pass', 'action_request,info_request' => 'pass', 'request' => 'as array', 'instruction_request' => sub {$_[1]->{request}}, }, ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ ...

    if you need to distinguish between action and info requests:

    ... 'action_request,info_request' => sub { my ($tag, $attr) = @_; +$attr->{type} = $tag; return %{$attr}}, ...

    In the first case only the data of one <request> are in memory at any time, in the others the whole data ends in memory, but trimmed down substantially.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Hi Jenda,

      Thanks for the feedback.

      The idea of validating before parsing is indeed what I intend to do in my final project. I was just doing a dry run in a test script to get comfortable with the modules before making changes to my code branch.

      That being said, good thoughts!! I will take a look at the XML::Rules module. I am currently upgrading my libxml2 on my mac. It comes with 2.6.16 installed and the latest is 2.7* .... that being said, I have a pretty rudimentary regex and it should pass. It passes in my XML Editor when it validates against the said XML Schema.