Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All I have a xml file with DTD shown below, which module would be best to parse the xml document, I tried with XML::Twig, but got strange answers.
<!--Comment --> <!ELEMENT Id (#PCDATA)> <!-- \d+ --> <!ELEMENT Item (#PCDATA|Item)*> <!-- .+ --> <!ATTLIST Item Name CDATA #REQUIRED Type (Integer|Date|String|Structure|List|Flags|Qualifier|Enumerato +r|Unknown) #REQUIRED > <!ELEMENT ERROR (#PCDATA)> <!-- .+ --> <!ELEMENT DocSum (Id, Item+)> <!ELEMENT eSummaryResult (DocSum|ERROR)+>
thanks

Replies are listed 'Best First'.
Re: DTD and xml module
by ww (Archbishop) on Jun 14, 2007 at 22:01 UTC
    The answers to two questions will help you get help here:
    • What have you tried? (i.e. show us some code.)
    • What was the output that you describe, somewhat imprecisely, as "strange?"
      here is the code and its printing the whole document,
      #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $file = 'Summary'; my $t = XML::Twig->new(twig_handlers => { docsum => \&docsum, para => sub {$_->set_gi('Item')} } ); $ty->parsefile($file); $ty->flush,"\n"; sub docsum{ my ($ty,$docsum) = @_; $docsum->set_gi('docsum'); my $title = $docsum->first_child('Item'); my $and = $title-{'att'}->{'Name'}; $docsum->flush; }
      Here is the XML file
      <?xml version="1.0"?> <!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 29 Octobe +r 2004//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_04 +1029.dtd"> <eSummaryResult> <DocSum> <Id>25</Id> <Item Name="Name" Type="String">ABL1</Item> <Item Name="Description" Type="String">v-abl Abelson murine leukem +ia viral oncogene homolog 1</Item> <Item Name="Orgname" Type="String">Homo sapiens</Item> <Item Name="Status" Type="Integer">0</Item> <Item Name="CurrentID" Type="Integer">0</Item> <Item Name="Chromosome" Type="String">9</Item> <Item Name="GeneticSource" Type="String">genomic</Item> <Item Name="MapLocation" Type="String">9q34.1</Item> <Item Name="OtherAliases" Type="String">ABL, JTK7, bcr/abl, c-ABL, + p150, v-abl</Item> <Item Name="OtherDesignations" Type="String">Abelson murine leukem +ia viral (v-abl) oncogene homolog 1|BCR/ABL (major breakpoint) fusion + peptide|bcr/c-abl oncogene protein|proto-oncogene tyrosine-protein k +inase ABL1</Item> <Item Name="NomenclatureSymbol" Type="String">ABL1</Item> <Item Name="NomenclatureName" Type="String">v-abl Abelson murine l +eukemia viral oncogene homolog 1</Item> <Item Name="NomenclatureStatus" Type="String">Official</Item> <Item Name="TaxID" Type="Integer">9606</Item> <Item Name="Mim" Type="List"> <Item Name="int" Type="Integer">189980</Item> </Item> <Item Name="GenomicInfo" Type="List"> <Item Name="GenomicInfoType" Type="Structure"> <Item Name="ChrLoc" Type="String">9</Item> <Item Name="ChrAccVer" Type="String">NC_000009.10</Item> <Item Name="ChrStart" Type="Integer">132579088</Item> <Item Name="ChrStop" Type="Integer">132752882</Item> </Item> </Item> </DocSum> <DocSum> <Id>27</Id> <Item Name="Name" Type="String">ABL2</Item> <Item Name="Description" Type="String">v-abl Abelson murine leukem +ia viral oncogene homolog 2 (arg, Abelson-related gene)</Item> <Item Name="Orgname" Type="String">Homo sapiens</Item> <Item Name="Status" Type="Integer">0</Item> <Item Name="CurrentID" Type="Integer">0</Item> <Item Name="Chromosome" Type="String">1</Item> <Item Name="GeneticSource" Type="String">genomic</Item> <Item Name="MapLocation" Type="String">1q24-q25</Item> <Item Name="OtherAliases" Type="String">RP11-177A2.3, ABLL, ARG</I +tem> <Item Name="OtherDesignations" Type="String">Abelson murine leukem +ia viral (v-abl) oncogene homolog 2|Abelson-related|v-abl Abelson mur +ine leukemia viral oncogene homolog 2</Item> <Item Name="NomenclatureSymbol" Type="String">ABL2</Item> <Item Name="NomenclatureName" Type="String">v-abl Abelson murine l +eukemia viral oncogene homolog 2 (arg, Abelson-related gene)</Item> <Item Name="NomenclatureStatus" Type="String">Official</Item> <Item Name="TaxID" Type="Integer">9606</Item> <Item Name="Mim" Type="List"> <Item Name="int" Type="Integer">164690</Item> </Item> <Item Name="GenomicInfo" Type="List"> <Item Name="GenomicInfoType" Type="Structure"> <Item Name="ChrLoc" Type="String">1</Item> <Item Name="ChrAccVer" Type="String">NC_000001.9</Item> <Item Name="ChrStart" Type="Integer">177465358</Item> <Item Name="ChrStop" Type="Integer">177343379</Item> </Item> </Item> </DocSum> <DocSum> <Id>90</Id> <Item Name="Name" Type="String">ACVR1</Item> <Item Name="Description" Type="String">activin A receptor, type I< +/Item> <Item Name="Orgname" Type="String">Homo sapiens</Item> <Item Name="Status" Type="Integer">0</Item> <Item Name="CurrentID" Type="Integer">0</Item> <Item Name="Chromosome" Type="String">2</Item> <Item Name="GeneticSource" Type="String">genomic</Item> <Item Name="MapLocation" Type="String">2q23-q24</Item> <Item Name="OtherAliases" Type="String">ACTRI, ACVRLK2, ALK2, FOP, + SKR1</Item> <Item Name="OtherDesignations" Type="String">activin A receptor, t +ype II-like kinase 2|activin A type I receptor|hydroxyalkyl-protein k +inase</Item> <Item Name="NomenclatureSymbol" Type="String">ACVR1</Item> <Item Name="NomenclatureName" Type="String">activin A receptor, ty +pe I</Item> <Item Name="NomenclatureStatus" Type="String">Official</Item> <Item Name="TaxID" Type="Integer">9606</Item> <Item Name="Mim" Type="List"> <Item Name="int" Type="Integer">102576</Item> </Item> <Item Name="GenomicInfo" Type="List"> <Item Name="GenomicInfoType" Type="Structure"> <Item Name="ChrLoc" Type="String">2</Item> <Item Name="ChrAccVer" Type="String">NC_000002.10</Item> <Item Name="ChrStart" Type="Integer">158403035</Item> <Item Name="ChrStop" Type="Integer">158301206</Item> </Item> </Item> </DocSum> <eSummaryResult>
      i want it to print in a text file with column separated by tab, but it seems flush prints out the whole document I more thing I want to ask, if we specify the attribute then I dont have to speciy its type, right?? thanks

        I still don't understand quite what it is you want to get. An example of the expected output would definitely help.

        Some comments though, maybe they will put you on the right track:

        • if you don't want to print the document, then don't use flush, because that's what it does. Now if what you want is to free memory, then purge is probably what you are looking for,
        • you have 2 handler: on on docsum and one on para. Sadly there are no docsum (there is a DocSum but XML is case-sensitive), nor para elements in the document, so these handlers are never called.
Re: DTD and xml module
by Jenda (Abbot) on Jun 15, 2007 at 12:31 UTC

    "parse the xml document" is not very informative. Until we know what do you want to do with the XML there's no telling which module will best fit the task. It's like asking what tool is best to work with wood. If we do not know what do you plan to do with the wood, we can't suggest anything.

      Hi Jenda, xml document is shown in the previous replies, what i am trying to do is for each item between <DocSum> tags, i am trying to write a text file with each enrty separated by tab, like for example id\t\item name\tItem Description.....and so on..........