Jerome has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I encounter problems in my try to extract datas from an XML file. I'm a bit in a urge so don't have much time to get this done so I would greatly appreciate your help!

My XML structure is like this :

<DSExport> <Header CharacterSet="ENGLISH" ExportingTool="Ascential DataStage E +xport" ToolVersion="4" ServerName="VILLETTE" ToolInstanceID="Portage" + Date="2006-03-29" Time="10.00.34" ServerVersion="7.5.1.A"/> <Job Identifier="adresses" DateModified="2006-03-28" TimeModified=" +17.15.15"> <Record Identifier="ROOT" DateModified="1899-12-30" TimeModified +="00.00.01" Type="JobDefn" Readonly="0"> <Property Name="Name">adresses</Property> <Property Name="Description">DFE/ job of addresses MAJ - 18/1 +1/2004</Property> <Property Name="NextID">2</Property> <Property Name="Container">V0</Property> <Property Name="FullDescription" PreFormatted="1"> RD/ Preprod ok - 15/05/2001 (!!!perf on upd?) Modified by LCO : 19/06/2001: Use system_origin as parameter Modified by LCO : 22/06/2001: Retreive Country Short Name Modified by EPE : 17/02/2004: ajout de la notion Sales Territory (Coun +try, region, sector) Modified by EPE : 19/02/2004: ajout de la notion Salesrep_id Modified by DFE : 18/11/2004: modification des requetes ar_ra_adresses +_n_phone et oa_ra_territories: ajout de comparaison avec la last_upda +te_date de HZ_CUST_SITE_USES_ALL </Property> <Property Name="JobVersion">50.0.0</Property> <Property Name="ControlAfterSubr">1</Property> <Collection Name="Parameters" Type="Parameters"> <SubRecord> <Property Name="Name">$GV_Portage_Rejects_Pathname</Pro +perty> <Property Name="Prompt">GV Portage Rejects Pathname</Pr +operty> <Property Name="Default">F:\Projects\BI_Interfaces\Port +age\Rejects</Property> <Property Name="ParamType">0</Property> <Property Name="ParamLength">0</Property> <Property Name="ParamScale">0</Property> </SubRecord> <SubRecord> <Property Name="Name">$GV_Portage_Hash_Pathname</Proper +ty> <Property Name="Prompt">GV Portage Hash Pathname</Prope +rty> <Property Name="Default">F:\Projects\BI_Interfaces\Port +age\Hash\</Property> <Property Name="ParamType">0</Property> <Property Name="ParamLength">0</Property> <Property Name="ParamScale">0</Property> </SubRecord> <Record Identifier="V0S0P1" DateModified="1899-12-30" TimeModifi +ed="00.00.01" Type="CustomOutput" Readonly="0"> <Property Name="Name">li_hf_adresses</Property> <Property Name="Partner">V0S1</Property> <Collection Name="Properties" Type="CustomProperty"> <SubRecord> <Property Name="Name">TABLE</Property> <Property Name="Value">APPS.RA_ADDRESSES_ALL</Property> </SubRecord> <SubRecord> <Property Name="Name">GENSQL</Property> <Property Name="Value">Yes</Property> </SubRecord> <SubRecord> <Property Name="Name">USERSQL</Property> <Property Name="Value" PreFormatted="1"> SELECT ADDRESS_ID,LAST_UPDATE_DATE,LAST_UPDATED_BY,CREATION_DATE,CREAT +ED_BY,COUNTRY,ADDRESS1,ADDRESS2,ADDRESS3,ADDRESS4,CITY,POSTAL_CODE,ST +ATE,PROVINCE,COUNTY FROM APPS.RA_ADDRESSES_ALL WHERE LAST_UPDATE_DATE +&gt;=(SELECT LAST_LOADING_DATE from DWADMIN.INT_LOADINGS@#DB_LINK# WHERE TABLE_NAME = &apos;CUS_ADDRESSES&apos; AND SYSTEM_ORIGIN = &apos;#system_origin#&apos; ) </Property> </SubRecord> <SubRecord> <Property Name="Name">ARRAYSIZE</Property> <Property Name="Value">32000</Property> </SubRecord> <SubRecord> <Property Name="Name">WHERE_CLAUSE</Property> <Property Name="Value" PreFormatted="1"> LAST_UPDATE_DATE&gt;=(SELECT LAST_LOADING_DATE from DWADMIN.INT_LOADIN +GS@#DB_LINK# WHERE TABLE_NAME = &apos;CUS_ADDRESSES&apos; AND SYSTEM_ORIGIN = &apos +;#system_origin#&apos; ) or (select max(last_update_date) from apps.HZ_CUST_SITE_USES_ALL where cust_acct_site_id = RA_ADDRESSES_ALL.address_id) &gt;=(SELECT LAST_LOADING_DATE from DWADMIN.INT_LOADINGS@#DB_LINK# WHERE TABLE_NAME = &apos;CUS_ADDRESSES&apos; AND SYSTEM_ORIGIN = &apo +s;#system_origin#&apos;) </Property> </SubRecord> ...
Sorry that I quoted a so long example. It's just to show you thers a lot of datas that I dont wabt to retrieve.... But they have the sames markup than the datas I'm interested in.

My objective :
I would like to access the job name, wich is the Job identifier. The I would like to grab names like
<Property Name="Name">li_hf_adresses</Property>

The rule is that if the subrecord contains the property Named TABLE, I would like to retrieve the Values that comes next. Same thing if the property name is USER SQL.

But I can't figure out what's the elements name. I tried with childrens or Twig/Headers but no good.

Could someone give me a clue please?
Thanks in advance!

Here is my perl code:

#!/bin/perl -w use strict; use XML::Twig; my $t= new XML::Twig( TwigHandlers=> { job => \&job}); $t->parsefile( 'T:\\BI\\Jerome\\xml\\tree.xml'); exit; sub job { my ($t, $job)= @_; my %job; $job{Attr_name}= join ':',map {$_->text || ''} @{[$job->children( +'Name')]}; $job{Attr_value}= join ':',map {$_->text || ''} @{[$job->children( + 'Value')]}; print "$job{Attr_name}: $job{Attr_value}\n"; $job->delete; }

20060401 Janitored by Corion: Put XML into readmore tag

Replies are listed 'Best First'.
Re: XML::Twig usage incomprehension
by Skeeve (Parson) on Mar 31, 2006 at 17:47 UTC
    You get the attributes with
    $job->{'att'}->{'attributename'}
    Have you ever tried perldoc XML::Twig?

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

      Well, $job->{'att'}->{'attributename'} actually breaks the encapsulation. The proper way would be $job->att( 'attributename'). Which of course begs the question: Have you ever tried perldoc XML::Twig? ;--) (or looking at the XML::Twig Quick Reference?)

      I can't really blame you though, I tend to do the same thing: when I start working with a module, I use the debugger to figure out its implementation, and often "forget" to read the API.

        I read it a thousand times or more. Not from top to bottom, but using /searchword and then reading. I actually never read anything about $job->att('attributename') and was always asking myself, why I had to use a hash here. I never had the idea to search further. Thanks for pointing me at it!

        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        OK, thanks for the clue!

        I can't swear I already used this method but it seems familiar to me....

        Though, I know the perloc -f command but the twig one seems too much complex for me...nobody's perfect ;)

        The twig web site got some examples too, I tried to adapt these examples to my needs but with no success.

        I'll try this code portion tommorrow and will keep you informed about the results.

        Regards.

        Jerome.

Re: XML::Twig usage incomprehension
by mirod (Canon) on Apr 03, 2006 at 08:28 UTC

    I apologize, but instead of an answer, I will offer you a semi-rant on how to ask a question so it's easy for me (and any one else!) to answer. Don't take it baddly, I just hope this will help you get an answer for this question... and the next ones.

    For me to answer, your question should be as close to a test as possible: an example of your (non-working) code, with data (I like it in a DATA section, so I can just cut'n paste the whole thing and have it all in a single file), and the results you expect. If you write it as a real test, using Test or Test::More, that's even better.

    As it is your XML is really long and not well-formed, and I quickly gave up on trying to get the closing tags right. Plus it is a bit difficult for me to figure out exactly what result you want.

    If you post something that I can run, fix and post back, you have a much better chance to get a usefull answer.

      Hello Mirod,
      Let me be more accurate then ;)

      Here's a shortened data sample :

      <?xml version="1.0" encoding="UTF-8"?> <DSExport> <Header CharacterSet="ENGLISH" ExportingTool="Ascential DataStage E +xport" ToolVersion="4" ServerName="VILLETTE" ToolInstanceID="Portage" + Date="2006-03-29" Time="10.00.34" ServerVersion="7.5.1.A"/> <Job Identifier="adresses" DateModified="2006-03-28" TimeModified=" +17.15.15"> <Record Identifier="ROOT" DateModified="1899-12-30" TimeModified +="00.00.01" Type="JobDefn" Readonly="0"> <Property Name="Name">adresses</Property> <Property Name="Description">DFE/ job of addresses MAJ - 18/1 +1/2004</Property> <Property Name="ControlAfterSubr">1</Property> </Record> <Record Identifier="V0S0" DateModified="1899-12-30" TimeModified +="00.00.01" Type="CustomStage" Readonly="0"> <Property Name="AllowColumnMapping">0</Property> <Property Name="NextRecordID">0</Property> </Record> <Record Identifier="V0S0P1" DateModified="1899-12-30" TimeModifi +ed="00.00.01" Type="CustomOutput" Readonly="0"> <Property Name="Name">li_hf_adresses</Property> <Property Name="Partner">V0S1</Property> <Collection Name="Properties" Type="CustomProperty"> <SubRecord> <Property Name="Name">TABLE</Property> <Property Name="Value">APPS.RA_ADDRESSES_ALL</Property> </SubRecord> <SubRecord> <Property Name="Name">USERSQL</Property> <Property Name="Value" PreFormatted="1"> SELECT ADDRESS_ID,LAST_UPDATE_DATE,LAST_UPDATED_BY,CREA +TION_DATE,CREATED_BY,COUNTRY,ADDRESS1,ADDRESS2,ADDRESS3,ADDRESS4,CITY +,POSTAL_CODE,STATE,PROVINCE,COUNTY FROM APPS.RA_ADDRESSES_ALL WHERE L +AST_UPDATE_DATE&gt;=(SELECT LAST_LOADING_DATE from DWADMIN.INT_LOADIN +GS@#DB_LINK# WHERE TABLE_NAME = &apos;CUS_ADDRESSES&apos; AND SYSTEM_ORIGIN = &apos;#system_origin#&apos; ) </Property> </SubRecord> <SubRecord> <Property Name="Name">ARRAYSIZE</Property> <Property Name="Value">32000</Property> </SubRecord> </Collection> <Property Name="TopTextPos">478</Property> </Record> <Record Identifier="V0S19P4" DateModified="1899-12-30" TimeModif +ied="00.00.01" Type="CustomInput" Readonly="0"> <Property Name="Name">li_addresses_ins</Property> <Property Name="Partner">V0S17</Property> <Collection Name="Properties" Type="CustomProperty"> <SubRecord> <Property Name="Name">TABLE</Property> <Property Name="Value">DWADMIN.CUS_ADDRESSES</Property> </SubRecord> <SubRecord> <Property Name="Name">GENSQL</Property> <Property Name="Value">No</Property> </SubRecord> <SubRecord> <Property Name="Name">USERSQL</Property> <Property Name="Value" PreFormatted="1"> INSERT INTO DWADMIN.CUS_ADDRESSES ( ADDRESS_ID, ORIGINAL_ADDRESS_ID, ADDRESS_1,ADDRESS_2, A +DDRESS_3, ADDRESS_4,CITY,ZIP_CODE,COUNTRY,ADD_INFORMATION_1,ADD_INFOR +MATION_2,ADD_INFORMATION_3,EMAIL,CREATED_BY,CREATION_DATE,LAST_UPDATE +D_BY,LAST_UPDATE_DATE,SYSTEM_ORIGIN,SALES_TERRITORY_COUNTRY,SALES_TER +RITORY_ADMIN_REGION,SALES_TERRITORY_SECTOR,SALESREP_ID) VALUES (DWADM +IN.CUS_ADDRESS_ID_SEQ.NEXTVAL,:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12, +:13,TO_DATE(:14, &apos;YYYY-MM-DD HH24:MI:SS&apos;),:15,TO_DATE(:16, +&apos;YYYY-MM-DD HH24:MI:SS&apos;),:17,:18,:19,:20,:21) </Property> </SubRecord> </Collection> </Record> </Job> </DSExport>
      The result I would like to get is :
      <Job Identifier="adresses"> <tables>APPS.RA_ADDRESSES_ALL</tables> <usersql>SELECT ADDRESS_ID,LAST_UPDATE_DATE,LAST_UPDATED_BY,CREA +TION_DATE,CREATED_BY,COUNTRY,ADDRESS1,ADDRESS2,ADDRESS3,ADDRESS4,CITY +,POSTAL_CODE,STATE,PROVINCE,COUNTY FROM APPS.RA_ADDRESSES_ALL WHERE L +AST_UPDATE_DATE&gt;=(SELECT LAST_LOADING_DATE from DWADMIN.INT_LOADIN +GS@#DB_LINK# WHERE TABLE_NAME = &apos;CUS_ADDRESSES&apos; AND SYSTEM_ORIGIN = &apos;#system_origin#&apos; ) </usersql> </Job>
      The table field should be the one line 20 and usersql line 25.

      my not code working is :

      use strict; use XML::Twig; my $t= new XML::Twig(); my $root = $t->root; $t->parsefile( 'T:\\BI\\Jerome\\xml\\tree.xml'); exit; sub job { my ($t, $root)= @_; my %job; $job{Attr_name}= $root->{'att'}->{'Name'}; $job{Attr_value}= $root->{'att'}->{'Value'}; print "$job{Attr_name}: $job{Attr_value}\n"; $root->delete; }
      Note : in that example I just wanted to list all the Name and Value fields but I get no output.....

      Hope my new explanations will be better for you.

      Thanks in advance.

      Jerome

        Sorry, still not enough details. Why do you only include 1 record in the output, when there are 2 properties "TABLE" in the input? Also the code you wrote does not output XML, but a list of property name/values, so your expected result should be at least consistent with that.

        Plus (going back to rant mode) why did you truncate the usersql content? I would appreciate not having to track it down in the input to cut and paste it.

        That said, the input XML seems incredibly clumsy. Faced with something like this, my first instinct would be to turn all those Subrecords that include a single property defined by 2 Property elements into a single element, named after the property name, with the value of the property as content. Once that's done it should be much easier to work with the data.

        The code below does just that.

        #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $t= XML::Twig->new( twig_roots => { 'Collection[@Type="CustomProper +ty"]' => \&collection}, twig_print_outside_roots => 1, ); $t->parsefile( $ARG[0]); exit; sub collection { my ($t, $collection)= @_; foreach my $sub_record ($collection->children( 'SubRecord')) { my $name = $sub_record->field( 'Property[@Name="Name"]'); my $value = $sub_record->field( 'Property[@Name="Value"]'); # we want to keep thæ attributes (except Name of course) my $atts = $sub_record->first_child( 'Property[@Name="Value"] +')->atts; delete $atts->{Name}; my $property= XML::Twig::Elt->new( $name, $atts, $value); $property->replace( $sub_record); } $collection->print; }