the_r has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I'm pretty new to perl. I have an xml file that has several different xml messages. What is common about each of the xml messages is that they each have a child node with the same name (EventInfo). The following is an example of the payloads:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <DeliveryTimeChanged CurrentStatus="OnHold" xmlns:ns2="http://com/post/orderupdatesasync/jaxbxml"> <EventInfo EventId="666313444" CreationDatetime="2017/02/09 07:59:17 369 GMT" RequestId="321150454"> <TopicCounts TopicName="DELIVERY.TIME.CHANGED" TopicCount="1"/> </EventInfo> <DeliveryChangeOperationType OperationTypeCode="DELAY" OperationSubtypeCode="HOLD" DeliveryChangeReason="Weather" DeliveryDate="20170210"/> </DeliveryTimeChanged>

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <DeliveryRouteChanged CurrentStatus="OnHold" xmlns:ns2="http://com/post/orderupdatesasync/jaxbxml"> <EventInfo EventId="666313445" CreationDatetime="2017/02/09 07:59:23 639 GMT" RequestId="321150454"> <TopicCounts TopicName="DELIVERY.ROUTE.CHANGED" TopicCount="1"/> </EventInfo> <DeliveryRouteType OperationTypeCode="AIR" OperationSubtypeCode="HOLD" DeliveryDate="20170210"/> </DeliveryRouteChanged>

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <DeliveryCanceled CurrentStatus="Canceled" xmlns:ns2="http://com/post/orderupdatesasync/jaxbxml"> <EventInfo EventId="666313446" CreationDatetime="2017/02/09 07:59:44 963 GMT" RequestId="421150444"> <TopicCounts TopicName="DELIVERY.STATUS.CANCELED" TopicCount="1"/> </EventInfo> <DeliveryStatusType DeliveryStatusCode="CX" OperationSubtypeCode="CANCELED" DeliveryDate="20170210"/> </DeliveryCanceled>

What I would like is to pull the entire xml message that has a certain RequestId attribute value (321150454) in the EventInfo node regardless of what the parent node is.

I have tried the following perl script:

perl -ne ' if(/EventInfo>/){$p=0} if(/RequestId="321150454"/) {print $ +ARGV; print " "; print; $p=1;next}print if$p' sample.xml

The output is only giving me the EventInfo node:

<EventInfo EventId="666313444" CreationDatetime="2017/02/09 07:59:17 369 GMT" RequestId="321150454"> <TopicCounts TopicName="DELIVERY.TIME.CHANGED" TopicCount="1"/>

sample.xml <EventInfo EventId="666313445" CreationDatetime="2017/02/09 07:59:23 639 GMT" RequestId="321150454"> <TopicCounts TopicName="DELIVERY.ROUTE.CHANGED" TopicCount="1"/>

What I would like is the entire xml payload of these two records:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <DeliveryTimeChanged CurrentStatus="OnHold" xmlns:ns2="http://com/post/orderupdatesasync/jaxbxml"> <EventInfo EventId="666313444" CreationDatetime="2017/02/09 07:59:17 369 GMT" RequestId="321150454"> <TopicCounts TopicName="DELIVERY.TIME.CHANGED" TopicCount="1"/> </EventInfo> <DeliveryChangeOperationType OperationTypeCode="DELAY" OperationSubtypeCode="HOLD" DeliveryChangeReason="Weather" DeliveryDate="20170210"/> </DeliveryTimeChanged>

AND

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <DeliveryRouteChanged CurrentStatus="OnHold" xmlns:ns2="http://com/post/orderupdatesasync/jaxbxml"> <EventInfo EventId="666313445" CreationDatetime="2017/02/09 07:59:23 639 GMT" RequestId="321150454"> <TopicCounts TopicName="DELIVERY.ROUTE.CHANGED" TopicCount="1"/> </EventInfo> <DeliveryRouteType OperationTypeCode="AIR" OperationSubtypeCode="HOLD" DeliveryDate="20170210"/> </DeliveryRouteChanged>

How do I get the entire xml payload (parent and child nodes)? Any help with this would be greatly appreciated. Thanks for your time.

Replies are listed 'Best First'.
Re: Retrieving XML From a File Based On Child Node Attribute
by Anonymous Monk on Feb 10, 2017 at 21:01 UTC
    Do not use regular expressions to parse XML.
    use XML::LibXML; my $ReqId = '321150454'; my $doc = XML::LibXML->load_xml(string=>$xml); my @nodes = $doc->findnodes("/*/EventInfo[\@RequestId='$ReqId']"); for my $node (@nodes) { print "### ", $node->getParentNode->toString, " ###\n\n" ; }

      Thanks for the reply. The reason I used regular expressions was that the actual XML is contained within a log file that contains other information besides xml. Will the XML:LibXML handle any type of file or does it strictly need a xml file?

      I tried the following using this script and am getting a parser error Start tag expected, '<' not found. Below is the code:

      #!/usr/bin/perl use XML::LibXML; my $requestId = $ARGV[0]; my $fileName = "sample.xml"; print "$requestId\n"; print "$fileName\n"; my $doc = XML::LibXML->load_xml(string=>$fileName); my @nodes = $doc->findnodes("/*/EventInfo[\@RequestId='$requestId']"); for my $node (@nodes) { print "### ", $node->getParentNode->toString, " ###\n\n"; }

      The sample xml file is as follows:

      <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <DeliveryTimeChanged CurrentStatus="OnHold" xmlns:ns2="http://com/post/orderupdatesasync/jaxbxml"> <EventInfo EventId="666313444" CreationDatetime="2017/02/09 07:59:17 369 GMT" RequestId="321150454"> <TopicCounts TopicName="DELIVERY.TIME.CHANGED" TopicCount="1"/> </EventInfo> <DeliveryChangeOperationType OperationTypeCode="DELAY" OperationSubtypeCode="HOLD" DeliveryChangeReason="Weather" DeliveryDate="20170210"/> </DeliveryTimeChanged>

      <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <DeliveryRouteChanged CurrentStatus="OnHold" xmlns:ns2="http://com/post/orderupdatesasync/jaxbxml"> <EventInfo EventId="666313445" CreationDatetime="2017/02/09 07:59:23 639 GMT" RequestId="321150454"> <TopicCounts TopicName="DELIVERY.ROUTE.CHANGED" TopicCount="1"/> </EventInfo> <DeliveryRouteType OperationTypeCode="AIR" OperationSubtypeCode="HOLD" DeliveryDate="20170210"/> </DeliveryRouteChanged>

      <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <DeliveryCanceled CurrentStatus="Canceled" xmlns:ns2="http://com/post/orderupdatesasync/jaxbxml"> <EventInfo EventId="666313446" CreationDatetime="2017/02/09 07:59:44 963 GMT" RequestId="421150444"> <TopicCounts TopicName="DELIVERY.STATUS.CANCELED" TopicCount="1"/> </EventInfo> <DeliveryStatusType DeliveryStatusCode="CX" OperationSubtypeCode="CANCELED" DeliveryDate="20170210"/> </DeliveryCanceled>

        Hi the_r,

        First, have a look at the documentation, and note that XML::LibXML->load_xml(string=>$fileName); is trying to parse the string contained in $fileName. What you want is XML::LibXML->load_xml(location=>$fileName); instead.

        Will the XML:LibXML handle any type of file or does it strictly need a xml file?

        It will need an XML file conforming to the specifications. I am having trouble understanding the sample data you posted, please use <code> tags. Is this all one file, or three separate files? If the latter, then the above change should be all you need.

        If however the input you pasted here is from one single file (as you seem to be saying with the "log file"), then this is not a standard XML file, as the <?xml...?> declaration may only appear once, at the top of the file. First, I would recommend you check the source of the data, whether you can retrieve the pieces of XML as individual files. If not, I might complain to whomever is generating this data that it does not conform to XML specifications :-)

        If that doesn't work, you may be left with parsing the file and breaking it into individual chunks that a normal XML parser can handle, in that case, you'll have to show a sample input that is representative of the data you're getting, in <code> tags. But try and see if you can get data conforming to the standards first.

        Hope this helps,
        -- Hauke D

Re: Retrieving XML From a File Based On Child Node Attribute
by Laurent_R (Canon) on Feb 11, 2017 at 10:40 UTC
Re: Retrieving XML From a File Based On Child Node Attribute
by poj (Abbot) on Feb 11, 2017 at 12:21 UTC

    Am I right in assuming all those individual messages are contained in one file ?

    If they are try

    perl -ne 'BEGIN{$/="\n\n"};print if /RequestId="321150454"/' sample.xml
    poj
      The "<?xml ..." means they must be different files.