Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

To start, perl is not my first language, though I can get by normally... and I am an XML novice. I googled till my brain hurts and haven't found anything that's really cleared up my needs, so here I am. Thanks for any insight that can be provided.

I've been given an assignment to capture specific data out of XML files being received daily from a vendor. The vendor provided the following (I provided a few examples, there are actually several dozen XPaths they provided) as reference to what I need to extract from the XML file.

//Party[@id=//Relation[child::RelationRoleCode[@tc='37']]/@RelatedObje +ctID]/Producer/CarrierAppointment/CompanyProducerID Holding[main]/Policy/KeyedValue/KeyValue substring(//OLifE/Party[@id=//OLifE/Relation[RelationRoleCode/@tc=8]/@ +RelatedObjectID]/Person/FirstName, 1, 30) //OLifE/Party[@id=//OLifE/Relation[RelationRoleCode/@tc=8]/@RelatedObj +ectID]/Person/BirthDate //OLifE/Holding[main]/Policy/KeyedValue[KeyName = 'SponsorName']/KeyVa +lue

I was able to use XML::Parser with Data::Dumper to dump the whole file... not a lot of help, but I knew I could read a file. Then I used this snippet to get all the TAGS, but still not any help getting the data for the specified XPaths.

#!/usr/opt/perl5/bin/perl use XML::Parser; my $parser = new XML::Parser (); $parser->setHandlers ( Start => \&Start_handler, End => \&End_handler, Default => \&Default_handler ); my $filename = shift; die "Can't find '$filename': $!\n" unless -f $filename; $parser->parsefile ($filename); ### HANDLERS ### sub Start_handler { my $p = shift; my $el = shift; print "START: <$el>\n"; while (my $key = shift) { my $val = shift; print " $key = $val\n"; } print "\n"; } ### sub End_handler { my ($p,$el) = @_; print "END: </$el>\n"; } ### sub Default_handler { my ($p,$str) = @_; # print " default handler found '$str'\n"; }

Here is are snippets of a sample XML file

<?xml version="1.0"?> <TXLife xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs +d="http://www.w3.org/2001/XMLSchema" Version="2.27.00" xmlns="http:// +ACORD.org/Standards/Life/2"> <UserAuthRequest> <UserLoginName>userID</UserLoginName> <UserPswd> <CryptType>NONE</CryptType> <Pswd>!password</Pswd> </UserPswd> <VendorApp> <VendorName>Something Technologies</VendorName> <AppName>FireMeNow</AppName> </VendorApp> </UserAuthRequest> <TXLifeRequest id="TXLifeRequest_xxxxxxxx-5e38-411b-9f56-2f2c7474f5c +5" PrimaryObjectID="Holding_aa2b0594-77d6-4264-b277-0218e852cb36"> <TransRefGUID>ccb03d81-7772-4cae-bbf6-436cda593b31</TransRefGUID> <TransType tc="103">New Business Submission for a </TransType> <TransExeDate>2015-09-10</TransExeDate> <TransExeTime>11:45:29.1882187-06:00</TransExeTime> <OLifE> <Holding id="Holding_aa2b0594-77d6-4264-b277-0218e852cb36"> <HoldingTypeCode tc="2">Policy</HoldingTypeCode> <CurrencyTypeCode tc="840">USD (United States Dollar)</Currenc +yTypeCode> <Policy> <ProductType tc="10">Variable Annuity</ProductType> <ApplicationInfo> <TrackingID>243958e8-f013-4a9b-a8de-996c5f5c93ed</Tracking +ID> </ApplicationInfo> <FinancialActivity> <FinActivityType tc="7">Initial payment - This is the </Fi +nActivityType> <Payment> <SourceOfFundsTC tc="32">Retirement</SourceOfFundsTC> </Payment> </FinancialActivity> <KeyedValue> <KeyName>AccountType</KeyName> <KeyValue>Individual</KeyValue> </KeyedValue> <KeyedValue> <KeyName>CheckIndicator</KeyName> <KeyValue>No</KeyValue> </KeyedValue> <KeyedValue> <KeyName>1035ExchangeIncluded</KeyName> <KeyValue>No</KeyValue> </KeyedValue> <KeyedValue> <KeyName>SponsorName</KeyName> <KeyValue>InsCompany</KeyValue> </KeyedValue> </Policy> </Holding> <Holding id="Holding_f2cfb5bd-6009-4bf2-95a0-686cb69a3b7a"> <HoldingTypeCode tc="2">Policy</HoldingTypeCode> <CurrencyTypeCode tc="840">USD (United States Dollar)</Currenc +yTypeCode> <AssetValue>79899.00</AssetValue> <Policy CarrierPartyID="Party_2eed3423-601b-4792-95d5-75f69e75 +7e71"> <Annuity> <SurrenderCharge>0.00</SurrenderCharge> </Annuity> <KeyedValue> <KeyName>OldProductType</KeyName> <KeyValue>401K</KeyValue> </KeyedValue> </Policy> </Holding> <Party id="Party_2eed3423-601b-4792-95d5-75f69e757e71"> <PartyTypeCode tc="2">Organization</PartyTypeCode> <FullName>InsCompany</FullName> <Organization /> </Party> <Party id="Party_2d205fbf-cadd-4475-9d51-a8a8aea1c625"> <PartyTypeCode tc="1">Person</PartyTypeCode> <Person> <FirstName>TheFirstName</FirstName> <LastName>TheLastName</LastName> </Person> <Producer> <CarrierAppointment> <CompanyProducerID>459JQ</CompanyProducerID> </CarrierAppointment> </Producer> </Party> <Party id="Party_869c61bc-96f9-497d-ad5b-a4341aec5311"> <PartyTypeCode tc="1">Person</PartyTypeCode> <GovtID>123-45-6789</GovtID> <Person> <FirstName>User</FirstName> <LastName>Test</LastName> <BirthDate>1966-01-22</BirthDate> </Person> <Address id="Address_8dc316c5-bcc3-45ff-b1c1-a15bc233c9d6"> <AddressTypeCode tc="30">OLI_ADTYPE_PRIMARY</AddressTypeCode +> <AddressStateTC tc="55">OLI_USA_VA</AddressStateTC> <AddressCountryTC tc="1">United States of America</AddressCo +untryTC> </Address> <KeyedValue> <KeyName>ForeignAddressInd</KeyName> <KeyValue>no</KeyValue> </KeyedValue> <KeyedValue> <KeyName>ForeignCitizenInd</KeyName> <KeyValue>yes</KeyValue> </KeyedValue> </Party> <Relation id="Relation_0e3f2e86-3851-401e-b069-25c31eb8d989" Ori +ginatingObjectID="Holding_aa2b0594-77d6-4264-b277-0218e852cb36" Relat +edObjectID="Holding_f2cfb5bd-6009-4bf2-95a0-686cb69a3b7a"> <OriginatingObjectType tc="4">Holding</OriginatingObjectType> <RelatedObjectType tc="4">Holding</RelatedObjectType> <RelationRoleCode tc="64">OLI_REL_REPLACEDBY</RelationRoleCode +> </Relation> <Relation id="Relation_daf1bb84-658a-4bad-aff7-86d5fc755101" Ori +ginatingObjectID="Holding_aa2b0594-77d6-4264-b277-0218e852cb36" Relat +edObjectID="Party_2d205fbf-cadd-4475-9d51-a8a8aea1c625"> <OriginatingObjectType tc="4">Holding</OriginatingObjectType> <RelatedObjectType tc="6">OLI_PARTY</RelatedObjectType> <RelationRoleCode tc="37">OLI_REL_PRIMAGENT</RelationRoleCode> </Relation> <Relation id="Relation_e944395e-0545-48f5-a38d-95b7cc08e73c" Ori +ginatingObjectID="Holding_aa2b0594-77d6-4264-b277-0218e852cb36" Relat +edObjectID="Party_869c61bc-96f9-497d-ad5b-a4341aec5311"> <OriginatingObjectType tc="4">Holding</OriginatingObjectType> <RelatedObjectType tc="6">OLI_PARTY</RelatedObjectType> <RelationRoleCode tc="8">OLI_REL_OWNER</RelationRoleCode> </Relation> <Relation id="Relation_80c5add6-3d38-4310-948c-79b442bc75ec" Ori +ginatingObjectID="Holding_aa2b0594-77d6-4264-b277-0218e852cb36" Relat +edObjectID="FormInstance_f76dede8-0d5e-43ae-8772-1d8aff465982"> <OriginatingObjectType tc="6">OLI_PARTY</OriginatingObjectType +> <RelatedObjectType tc="101">OLI_FORMINSTANCE</RelatedObjectTyp +e> <RelationRoleCode tc="107">OLI_REL_FORMFOR</RelationRoleCode> </Relation> <FormInstance id="FormInstance_f76dede8-0d5e-43ae-8772-1d8aff465 +982"> <FormName>APP</FormName> <Attachment id="Attachment_a1ac9dd9-4f6a-4cc3-bf00-e3e6d483a09 +a"> <DateCreated>2015-09-10</DateCreated> <AttachmentBasicType tc="1">OLI_LU_BASICATTMNTTY_TEXT</Attac +hmentBasicType> <Description>APP</Description> <AttachmentData></AttachmentData> <MimeTypeTC tc="17">OLI_INLINE</MimeTypeTC> <TransferEncodingTypeTC tc="4" /> <AttachmentLocation tc="1" /> </Attachment> </FormInstance> </OLifE> <OLifEExtension VendorCode="25" ExtensionCode="PROVIDER_VERSION"> <PROVIDER_VERSION>2.6.0.361</PROVIDER_VERSION> </OLifEExtension> </TXLifeRequest> </TXLife>

Any help would be greatly appreciated. Thank you

Replies are listed 'Best First'.
Re: Xpath value query
by Corion (Patriarch) on Sep 17, 2015 at 18:05 UTC

    I wouldn't use XML::Parser but one of the modules more oriented towards XPath expressions, such as XML::Twig or XML::LibXML. If your data fits in memory (and will do so for some time coming), using XML::LibXML can be considerably faster than XML::Twig, but if there might be a chance that your data will not fit into memory as a DOM tree, XML::Twig is the absolutely best way to go for a parser.

    Both modules will give you ways to issue XPath queries against the XML.

      OK, so I looked at XML::Twig, and it looks like it should do what I need. My problem is, when I use the XPaths I'm being provided, I get errors.. for example running this against my example XML file.

      #!/usr/opt/perl5/bin/perl -sw use XML::Twig; my $twig = XML::Twig->new(); $twig->parsefile($xmlfile); my $root = $twig->root; foreach my $i ($root->get_xpath('substring(//OLifE/Party[@id=//OLifE/R +elation[RelationRoleCode/@tc=8]/@RelatedObjectID]/Person/First Name, 1, 30)')) { print $i->{att}->{InvType}; foreach my $j ($i->get_xpath('../InvCounts/InvCount')) { print " " . $j->{att}->{Count}; } print "\n"; }

      the result is this

      ./twigs.pl -xmlfile=a2b9f375-51fe-41a1-86ab-069561517890.xml error in xpath expression substring(//OLifE/Party[@id=//OLifE/Relation +[RelationRoleCode/@tc=8]/@RelatedObjectID]/Person/FirstName, 1, 30) a +round substring(//OLifE/Party[@id=//OLifE/Relation[RelationRoleCode/@ +tc=8]/@RelatedObjectID]/Person/FirstName, 1, 30) at ./twigs.pl line 7

      So, I guess I don't understand how I would use the XPaths I've been provided. Do I have to find the attributes they referecne first, and then substitute that into the related entry in the XPath?

        You've been given

        //Party[@id=//Relation[child::RelationRoleCode[@tc='37']]/@RelatedObje +ctID]/Producer/CarrierAppointment/CompanyProducerID

        as an XPath expression. Why do you think you need to wrap substring(...) around that? What is your goal here?

        Maybe start out with the simple XPath expressions, find out what XML::Twig returns and then look at how you get from a node to the values you really want.

        I don't know what you are trying to extract but here's a simple example to get you started

        #!perl use strict; use XML::Twig; my $twig = XML::Twig->new(); $twig->parsefile('txlife.xml'); my $root = $twig->root; my @nodes1 = $root->get_xpath('//Holding'); for my $hold (@nodes1){ print "\nId = ".$hold->att('id')."\n"; my @nodes2 = $hold->get_xpath('Policy/KeyedValue/KeyValue'); for my $_ (@nodes2){ print $_->prev_sibling_text." = "; print $_->text."\n"; } }
        Ouput should be
        Id = Holding_aa2b0594-77d6-4264-b277-0218e852cb36 AccountType = Individual CheckIndicator = No 1035ExchangeIncluded = No SponsorName = InsCompany Id = Holding_f2cfb5bd-6009-4bf2-95a0-686cb69a3b7a OldProductType = 401K
        poj
      For massively large XML files, you can usually use XML::LibXML::Reader.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Xpath value query
by choroba (Cardinal) on Sep 18, 2015 at 12:47 UTC
    Note that your XML declares a prefixless namespace, i.e. all its nodes belong to a namespace. The XPath expressions don't mention any namespace, therefore there are no matches in the document. You can easily verify that in XML::XSH2, a wrapper around XML::LibXML:
    $scratch/> open 1.xml parsing 1.xml done. + + + /> +ls //Party Found 0 node(s). + + + /> +register-namespace l http://ACORD.org/Standards/Life/2 + + + /> +count //l:Party 3 /> ls //Party[@id=//Relation[child::RelationRoleCode[@tc='37']]/@Relat +edObjectID]/Producer/CarrierAppointment/CompanyProducerID Found 0 node(s). /> ls //l:Party[@id=//l:Relation[child::l:RelationRoleCode[@tc='37']]/ +@RelatedObjectID]/l:Producer/l:CarrierAppointment/l:CompanyProducerID <CompanyProducerID>459JQ</CompanyProducerID> Found 1 node(s).
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      OK, so this looks very promising... in that you are using one of the XPaths that include attribute references. However, I don't understand if you wrote a code snippet to use this functionality, or what. Again, I'm very new to XML so forgive my ignorance. Why/what does adding 1: to //Party[@id=//Relation[child::RelationRoleCode@tc='37']/@RelatedObjectID]/Producer/CarrierAppointment/CompanyProducerID do? I understanding something to do with namespace, but I'm not really sure I understand what a namespace is? What I really would like to know, is there a way to use this XPath to directly access it's value with a perl statement? Although I'm beginning to understand a little, I still feel like I'm neck deep in mud! Again, thanks for all the ideas and insight!