AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am trying to parse an xml file and am encountering errors, I think its to do with using the DTD. Here is part of the xml file:

<?xml version="1.0"?> <!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 29 Octobe +r 2004//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_04 +1029.dtd"> <eSummaryResult> <DocSum> <Id>7597478</Id> <Item Name="Caption" Type="String">NC_002192</Item> <Item Name="Title" Type="String">Lactococcus lactis plasmid pW +V01, complete sequence</Item> <Item Name="Extra" Type="String">gi|7597478|ref|NC_002192.1||g +nl|NCBI_GENOMES|15284[7597478]</Item> <Item Name="Gi" Type="Integer">7597478</Item> <Item Name="CreateDate" Type="String">1991/04/15</Item> <Item Name="UpdateDate" Type="String">2008/04/09</Item> <Item Name="Flags" Type="Integer">520</Item> <Item Name="TaxId" Type="Integer">1358</Item> <Item Name="Length" Type="Integer">2178</Item> <Item Name="Status" Type="String">live</Item> <Item Name="ReplacedBy" Type="String"></Item> <Item Name="Comment" Type="String"><![CDATA[ ]]></Item> </DocSum> <DocSum> <Id>7597489</Id> <Item Name="Caption" Type="String">NC_002193</Item> <Item Name="Title" Type="String">Lactococcus lactis cremoris C +remoris Wg2 plasmid pWVO2, complete sequence</Item> <Item Name="Extra" Type="String">gi|7597489|ref|NC_002193.1||g +nl|NCBI_GENOMES|15285[7597489]</Item> <Item Name="Gi" Type="Integer">7597489</Item> <Item Name="CreateDate" Type="String">1993/05/10</Item> <Item Name="UpdateDate" Type="String">2008/07/17</Item> <Item Name="Flags" Type="Integer">776</Item> <Item Name="TaxId" Type="Integer">1359</Item> <Item Name="Length" Type="Integer">3826</Item> <Item Name="Status" Type="String">live</Item> <Item Name="ReplacedBy" Type="String"></Item> <Item Name="Comment" Type="String"><![CDATA[ ]]></Item> </DocSum>

This is my code

#!/usr/bin/perl use strict; use warnings; use XML::LibXML; my $public_id = "-//NLM//DTD eSummaryResult, 29 October 2004//EN"; my $system_id = "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSumma +ry_041029.dtd"; my $dtd = XML::LibXML::Dtd->new($public_id, $system_id); my $filename='/g/Washu_PopGen/test_gi_docsumms_delVer4.xml'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); my $outfile ='/g/Washu_PopGen/test_gi_taxid_table.txt'; $doc ->validate($dtd); open(OUTFILE,">",$outfile); print OUTFILE join("t", qw(Id TaxId Length Status ReplacedBy))."\n"; foreach my $DocSum ($doc->findnodes('/eSummaryResult/DocSum')) { my($Id) = $DocSum->findnodes('./Id'); print OUTFILE $Id->to_literal, "\t"; my($TaxId) = $DocSum->findnodes('./TaxId'); print OUTFILE $TaxId->to_literal, "\t"; my($Length) = $DocSum->findnodes('./Length'); print OUTFILE $Length->to_literal, "\t"; my($Status) = $DocSum->findnodes('./Status'); print OUTFILE $Status->to_literal, "\t"; my($ReplacedBy) = $DocSum->findnodes('./ReplacedBy'); print OUTFILE $ReplacedBy->to_literal, "\n"; }

This is part of my Error message

No declaration for element eSummaryResult + + No declaration for element DocSum + + No declaration for element Id + + No declaration for element Item + + No declaration for attribute Name of element Item + + No declaration for attribute Type of element Item + + No declaration for element Item + + No declaration for attribute Name of element Item + + No declaration for attribute Type of element Item + + No declaration for element Item + + No declaration for attribute Name of element Item + + No declaration for attribute Type of element Item

This is the dtd file

<!-- This is the Current DTD for Entrez eSummary version 2 $Id: eSummary_041029.dtd 49514 2004-10-29 15:52:04Z parantha $ --> <!-- ================================================================= + --> <!ELEMENT Id (#PCDATA)> <!-- \d+ --> <!ELEMENT Item (#PCDATA|Item)*> <!-- .+ --> <!ATTLIST Item Name CDATA #REQUIRED Type (Integer|Date|String|Structure|List|Flags|Qualifier|Enumerato +r|Unknown) #REQUIRED > <!ELEMENT ERROR (#PCDATA)> <!-- .+ --> <!ELEMENT DocSum (Id, Item+)> <!ELEMENT eSummaryResult (DocSum|ERROR)+>

Thanks ! Any help is appreciated

Replies are listed 'Best First'.
Re: LibXML and parsing file with DTD
by ikegami (Patriarch) on Jul 29, 2010 at 16:49 UTC

    After adding the missing </eSummaryResult> (which prevented parsing even before validation), it validated without problem for me.

    I don't understand why it would fail for you. LibXML is very noisy on error, including download errors. Maybe your version has a bug in it? Or maybe it's picking up the DTD from the system catalog instead of downloading it? (The catalog is basically a schema cache so that you don't need to download the same schemas repeatedly. XML::LibXML doesn't add to the catalog, but it can read from it.)

Re: LibXML and parsing file with DTD
by derby (Abbot) on Jul 29, 2010 at 12:05 UTC

    Your XML is valid ... the real issue is that your XPATH for TaxId, Length, Status and ReplacedBy are wrong. Those should be:

    ./Item[@Name="TaxId"]

    -derby

      Thanks,I changed my code as you suggested and still code the same error.

      snippit of new code

      my($TaxId) = $DocSum->findnodes('./Item[@Name="TaxId"]'); print OUTFILE $TaxId->to_literal, "\t";

      From the error it looks like it doesn't like any of the xml entities all the way from the top nodes (ie. eSummary, DocSum

      error again

      No declaration for element eSummaryResult No declaration for element DocSum No declaration for element Id No declaration for element Item No declaration for attribute Name of element Item No declaration for attribute Type of element Item

        Hmmm ... what happens if you comment out the validate call?

        -derby