I (Think I) Need to Break Up with XML::Simple and I Don't Know Where to Turn

Randomice has asked for the wisdom of the Perl Monks concerning the following question:

Gentlemonks,

After coding a homegrown program to parse an external XML feed, it appears I goofed.

I thought I'd covered all the caveats before using XML::Simple as the parser. It worked wonderfully -- so well that I've coded many hundreds of lines with it. But I didn't study the DTDs carefully enough. Turns out, some nodes can be accurately keyed only if you match the order provided from the XML source.

Now I have to chose a new XML parser and recode the works. For this I need direction.

Here's a mock up of the XML:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE eggsmell [
<!ELEMENT eggsmell (timestamp, things)>
<!ELEMENT timestamp (#PCDATA)>
<!ELEMENT things (thing*)>
<!ELEMENT thing (unique_identifier, (foos |  bars | bazs)+)>
<!ELEMENT unique_identifier (#PCDATA)>
<!ELEMENT foos (foo*)>
<!ELEMENT foo (foo_datum)>
<!ELEMENT foo_datum (#PCDATA)>
<!ELEMENT bars (bar*)>
<!ELEMENT bar (bar_number, good?, bad?, ugly?, bazs?)>
<!ELEMENT bar_number (#PCDATA)>
<!ELEMENT good (good_info)>
<!ELEMENT bad (bad_info)>
<!ELEMENT ugly (ugly_info)>
<!ELEMENT good_info (#PCDATA)>
<!ELEMENT bad_info (#PCDATA)>
<!ELEMENT ugly_info (#PCDATA)>
<!ELEMENT bazs (baz)>
<!ELEMENT baz (#PCDATA)>
]>

<eggsmell>
   <timestamp>2011-05-09 23:05:33</timestamp>
   <things>
      <thing>
         <unique_identifier>23456</unique_identifier>
         <foos>
            <foo>
               <foo_datum>ABC</foo_datum>
            </foo>
            <foo>
               <foo_datum>XYZ</foo_datum>
            </foo>
         </foos>
         <bars>
            <bar>
               <bar_number>0</bar_number>
               <good>
                  <good_info>-118</good_info>
               </good>
               <bad>
                  <bad_info>-1.5</bad_info>
               </bad>
               <ugly>
                  <ugly_info>7.5</ugly_info>
               </ugly>
            </bar>
[download]

            <bar>
               <bar_number>1</bar_number>
               <bad>
                  <bad_info>0</bad_info>
               </bad>
               <ugly>
                  <ugly_info>3.5</ugly_info>
               </ugly>
            </bar>
         </bars>
      </thing>
      <thing>
         <unique_identifier>23458</unique_identifier>
         <foos>
            <foo>
               <foo_datum>SOS</foo_datum>
            </foo>
            <foo>
               <foo_datum>FML</foo_datum>
            </foo>
         </foos>
         <bars>
            <bar>
               <bar_number>0</bar_number>
               <good>
                  <good_info>-116</good_info>
               </good>
               <bad>
                  <bad_info>-1.5</bad_info>
               </bad>
               <ugly>
                  <ugly_info>7.5</ugly_info>
               </ugly>
            </bar>
            <bar>
               <bar_number>1</bar_number>
               <bad>
                     <bad_info>0</bad_info>
               </bad>
               <ugly>
                  <ugly_info>3.5</ugly_info>
               </ugly>
            </bar>
         </bars>
      </thing>
   </things>
</eggsmell>
[download]

The program parses values from the XML and compares them to previous instances stored in a database. No output XML needed, it just extracts element data for comparison and back-end storage.

The fly in my corn flakes is that <bar_number> is not a unique value. (I assumed it was.) So it can't be used as a key for <bar> . The current production code hasn't failed on this because (serendipitously) it filters out records that are more complicated, the ones where multiple <bar> nodes have the same <bar_number>.

Reviewing parsers that retain the node order, two stick out: XML::LibXML & XML::Twig.

The first seems like a steeper learning curve for someone still struggling to understand XPath. The best LibXML 101 piece I found is grantm's Stepping up from XML::Simple article. But there's a precipitous incline in the curve after that.

The second, XML::Twig, looks more accessible (and sexy! there's a twig and it has a t-shirt!) but even armed with documentation, I can't grasp the basic syntax enough to adapt the code I've already programmed.

With XML::Simple, I'm totally comfortable working with something like:

#!/usr/bin/perl

my $foo_datum;
my $unique_id;
my $bar_number;
my $good_info;

my $xml_source = XMLin($xml_file, forcearray => 
   [ qw (thing foo bar bazs) ], keyattr =>[ ]);

foreach my $thing_loop (@{$xml_source->{'things'}->{'thing'}} ) {
   $unique_id = $thing_loop->{'unique_identifier'};   
   
   foreach my $foo_loop (@{$thing_loop->{'foos'}->{'foo'}} ) {
      $foo_datum = $foo_loop->{'foo_datum'};
   }
   
   foreach my $bar_loop (@{$thing_loop->{'bars'}->{'bar'}} ) {
      $bar_number = $bar_loop->{'bar_number'};
      $good_info = $bar_loop->{'good'}->{'good_info'};
   }
   
   # Compare $foo_datum, $bar_number and $good_info with what's in
   # the database.
   
}
[download]

Many of the (plentiful) monastery XML::Twig examples use twig_handlers and subroutines. In my real data set, there are dozens of elements in each parent node. In my hands, it's a spaghetti mess trying to translate what should be simple stepping code. After reading examples of Twig 101, I tried the twig_roots method, but am still puzzled as to how to get at the data more than one child node deep (and in multiple iterations).

The vast bulk of my code consists of straightforward loops like:

my @unique_ids = (23456, 23458);
my $xml_source_key = XMLin($xml_file, forcearray => 
   [ qw (thing foo bar bazs) ], keyattr =>[ 'unique_identifier' ]);

foreach my $unique_id (@unique_ids) {
   foreach my $bar_loop ( @{$xml_source_key->{'things'}->{'thing'}->{$
+unique_id}->{'bars'}->{'bar'} } ) {
      $bar_number = $bar_loop->{'bar_number'};
      $good_info = $bar_loop->{'good'}->{'good_info'};
      
      # Do something with $bar_number, $good_info and a database
   }
}
[download]

The ability to convert that above snippet and establish a comfort level working with a new parser would really help in what promises to be a painstaking conversion process.

Any insight, thoughts, boot-strap code, references to how Google is search engine, direction or even recommendations for another module would be direly appreciated.

Comment on I (Think I) Need to Break Up with XML::Simple and I Don't Know Where to Turn Select or Download Code

Replies are listed 'Best First'.
Re: I (Think I) Need to Break Up with XML::Simple and I Don't Know Where to Turn by tobyink (Canon) on May 29, 2012 at 18:27 UTC
Here's your first example, rewritten to use XML::LibXML... use 5.010; use strict; use XML::LibXML 1.94; my $xml = XML::LibXML->load_xml(location => 'input.xml'); foreach my $thing ($xml->findnodes('//thing')) { # The difference between findnodes and findvalues is this: # findnodes returns a list of matching XML::LibXML::Node objects; # findvalue returns the first matching string (well, an object # that overloads stringification anyway). So here we use the # latter... # my $unique_id = $thing->findvalue('./unique_identifier'); say "==== ", $unique_id; foreach my $foo ($thing->findnodes('./foos/foo')) { my $foo_datum = $foo->findvalue('./foo_datum'); say "foo_datum: ", $foo_datum; } foreach my $bar ($thing->findnodes('./bars/bar')) { my $bar_number = $bar->findvalue('./bar_number'); my $good_info = $bar->findvalue('./good/good_info'); say "bar: ", $bar_number, ", good info: ", $good_info; } } [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re: I (Think I) Need to Break Up with XML::Simple and I Don't Know Where to Turn by Anonymous Monk on May 29, 2012 at 16:21 UTC
Does this help? You start with `xml2XMLRules pm973038.xml` and then you're running on steriods ( XML::Rules is XML::Simple on steriods ) #!/usr/bin/perl -- #~ 2012-05-29-09:18:14 by Anonymous Monk #~ perltidy -csc -otr -opr -ce -nibc -i=4 use strict; use warnings; use XML::Rules; use Data::Dump qw/ dd /; @ARGV or die "Usage: pm973038.pl pm973038.xml\n"; Main( @ARGV ); exit( 0 ); sub Main { my( $xmlfile ) = @_; my $t = XML::Rules->new( qw/ stripspaces 8 /, rules => [ 'bad_info,bar_number,foo_datum,good_info,timestamp,ugly_info,unique_id +entifier' => 'content', 'bar,foo,thing' => 'as array no content', 'bad,bars,eggsmell,foos,good,things,ugly' => 'no content' ], ); my $res = $t->parsefile( $xmlfile ); dd $res; } ## end sub Main __END__ $ perl pm973038.pl pm973038.xml \| perltidy { eggsmell => { things => { thing => [ { bars => { bar => [ { bad => { bad_info => -1.5 }, bar_number => 0, good => { good_info => -118 }, ugly => { ugly_info => 7.5 }, }, { bad => { bad_info => 0 }, bar_number => 1, ugly => { ugly_info => 3.5 } }, ], }, foos => { foo => [ { foo_datum => "ABC" }, { foo_datum => "XY +Z" } ] }, unique_identifier => 23456, }, { bars => { bar => [ { bad => { bad_info => -1.5 }, bar_number => 0, good => { good_info => -116 }, ugly => { ugly_info => 7.5 }, }, { bad => { bad_info => 0 }, bar_number => 1, ugly => { ugly_info => 3.5 } }, ], }, foos => { foo => [ { foo_datum => "SOS" }, { foo_datum => "FM +L" } ] }, unique_identifier => 23458, }, ], }, timestamp => "2011-05-09 23:05:33", }, } [download]	[reply] [d/l]