halfbaked has asked for the wisdom of the Perl Monks concerning the following question:

I need a really simple way to parse a tiny XML document, but I'm having trouble finding the proper solution.

I'm using XML::LibXML and I can easily pull out the doctype, charset and other data I need, but I can't figure out how to grab a list of the errors located in the errorlist tag.

Here's an example of the XML I need to parse, I can't change this.
<?xml version="1.0" encoding="UTF-8"?> <env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope"> <env:Body> <m:markupvalidationresponse env:encodingStyle="http://www.w3.org/2003/ +05/soap-encoding" xmlns:m="http://www.w3.org/2005/10/markup-validator +"> <m:uri>http://www.perlmonks.org/</m:uri> <m:checkedby>http://localhost/w3c-markup-validator/</m:checkedby> <m:doctype>-//W3C//DTD HTML 4.0 Transitional//EN</m:doctype> <m:charset>utf-8</m:charset> <m:validity>false</m:validity> <m:errors> <m:errorcount>59</m:errorcount> <m:errorlist> <m:error> <m:line>11</m:line> <m:col>66</m:col> + <m:message>document type does not allow element &quot; +LINK&quot; here</m:message> </m:error> <m:error> <m:line>14</m:line> <m:col>41</m:col> + <m:message>document type does not allow element &quot; +LINK&quot; here</m:message> </m:error> <m:error> <m:line>21</m:line> <m:col>4</m:col> + <m:message>document type does not allow element &quot; +META&quot; here</m:message> </m:error> </m:errorlist> </m:errors> <m:warnings> <m:warningcount>0</m:warningcount> <m:warninglist> <m:warning><m:message>No Character Encoding Found! Falling back to UTF-8. </m:message></m:warning> </m:warninglist> </m:warnings> </m:markupvalidationresponse> </env:Body> </env:Envelope>


Here's a snippet of the code I'm using to read the XML.
my $ua = LWP::UserAgent->new(); my $response = $ua->request($request); my $parser = XML::LibXML->new(); my $doc = $parser->parse_string($response->content); Kube::Demonize::logmsg($response->content); #for (my $i = 0; $i < @errorlist; $i++) { # Kube::Demonize::logmsg(sprintf("%s\n", $errorlist[$i]->getEle +mentsByTagName('m:line')->textContent)); #} foreach my $d ($doc->getElementsByTagName('m:doctype')) { print $d->textContent; } foreach my $d ($doc->getElementsByTagName('m:validity')) { print $d->textContent; } foreach my $d ($doc->getElementsByTagName('m:charset')) { print $d->textContent; }


Don't get bogged down in the implementation, this is just a prototype to illustrate what I'm trying to do.

I'm just looking for the easiest way to pull out the errors, stuff them in a Perl data structure, like an array of hashes.

Something like this:
@errors[0] = %error( line=>120, col=>2, message=>'Tag not allowed'); @errors[1] = %error( line=>220, col=>3, message=>'Another error?');


Thanks.

Replies are listed 'Best First'.
Re: Easiest way to parse a simple XML file?
by Tanktalus (Canon) on Dec 11, 2008 at 00:31 UTC

    My swiss-army-knife for XML is XML::Twig, which I used for this:

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $t = XML::Twig->new(); $t->parsefile('x.xml'); my @errors = map { my %d = map { (my $tag = $_->tag()) =~ s/^m://; $tag => $_->text() } $_->children(); \%d } $t->get_xpath('//m:error'); use Data::Dumper; print Dumper \@errors;
    The output is:
    $VAR1 = [ { 'col' => '66', 'message' => 'document type does not allow element "LINK" +here', 'line' => '11' }, { 'col' => '41', 'message' => 'document type does not allow element "LINK" +here', 'line' => '14' }, { 'col' => '4', 'message' => 'document type does not allow element "META" +here', 'line' => '21' } ];
    You can do the same for warnings and whatever else you have. With some practice with the xpath, you can ensure that you're getting just what you want.

      Thanks Tanktalus, I went with XML::Twig as you suggested, it seems the easiest way to go, but I always kind of scratch my head whenever I have to parse XML, it never seems as simple as it should be.

      Maybe it's just me.

      Thanks again.
Re: Easiest way to parse a simple XML file?
by ig (Vicar) on Dec 11, 2008 at 00:49 UTC

    And XML::Simple works

    use strict; use warnings; use Data::Dumper; use XML::Simple; my $ref = XMLin('test.xml'); print Dumper($ref->{"env:Body"}->{"m:markupvalidationresponse"}->{"m:e +rrors"}->{"m:errorlist"}->{"m:error"});

    produces

    $VAR1 = [ { 'm:line' => '11', 'm:col' => '66', 'm:message' => 'document type does not allow element "LINK +" here' }, { 'm:line' => '14', 'm:col' => '41', 'm:message' => 'document type does not allow element "LINK +" here' }, { 'm:line' => '21', 'm:col' => '4', 'm:message' => 'document type does not allow element "META +" here' } ];
Re: Easiest way to parse a simple XML file?
by Anonymous Monk on Dec 11, 2008 at 10:08 UTC
Re: Easiest way to parse a simple XML file?
by Jenda (Abbot) on Dec 12, 2008 at 17:30 UTC
    use strict; use XML::Rules; my $parser = XML::Rules->new( stripspaces => 7, namespaces => { 'http://www.w3.org/2005/10/markup-validator' => '', 'http://www.w3.org/2003/05/soap-envelope' => 'env', }, start_rules => { warnings => 'skip', }, rules => { _default => 'content', error => 'as array no content', errorlist => 'no content', errors => sub { return errors => $_[1]->{errorlist}{error}}, warning => sub {return warning => $_[1]->{message}}, 'markupvalidationresponse' => 'no content', 'env:Body' => 'as is', 'env:Envelope' => sub {return $_[1]->{'env:Body'}{markupvalida +tionresponse}}, } ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ <?xml version="1.0" encoding="UTF-8"?> <env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope"> ...

    XML::Rules lets you tweak and trim the structure produced by parsing the XML so that you end up with only the stuff you need in a format that's most convenient to you.