Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I am new to Perl and I am trying the past couple of days to use the Simple::XML module to parse some info from an XML document. The wanted tags from the whole document, which is called data.xml are the following:
<?xml version='1.0'?> <PubmedArticle> <PMID>1766380</PMID> <Article PubModel="Print"> <Journal> <JournalIssue CitedMedium="Print"> <Volume>5</Volume> <Issue>9</Issue> <PubDate> <Year>1991</Year> <Month>Sep</Month> </PubDate> </JournalIssue> <ISOAbbreviation>Mol. Microbiol.</ISOAbbreviation> </Journal> <ArticleTitle>PhoP/PhoQ: macrophage-specific modulator +s of Salmonella virulence?</ArticleTitle> <Pagination> <MedlinePgn>2073-8</MedlinePgn> </Pagination> <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Miller</LastName> <ForeName>S I</ForeName> <Initials>SI</Initials> </Author> <Author ValidYN="Y"> <LastName>Tsirigos</LastName> <ForeName>K T</ForeName> <Initials>KT</Initials> </Author> <Author ValidYN="Y"> <LastName>Dinous</LastName> <ForeName>A E</ForeName> <Initials>AE</Initials> </Author> </AuthorList> </Article> </PubmedArticle>
I read numerous tutorials and I managed to print all other tags apart from the details for the authors, that is <LastName>, <ForeName>, <Initials>. I used the following code:
#!/usr/bin/perl use XML::Simple; use Data::Dumper; $simple = XML::Simple->new; $data = $simple->XMLin('data.xml'); print $data->{PMID}; print $data->{Article}->{...}->{...};
I only have problem with the authors, because there can be more than one, contrary to all the other info contained in my xml... Can you help me please?

Replies are listed 'Best First'.
Re: Perl & Simple::XML
by throop (Chaplain) on Oct 06, 2007 at 03:23 UTC
    I expect your problem is a quirk with XML::Simple. If there is just one author in your AuthorList, XML::Simple turns it into a hash something like this
    {AuthorList => {CompleteYN=>"Y", Author => {ValidYN=>"Y", LastName=>"Dinos", ForeName=>"A E"}}}
    but if there's more than one author, XML::Simple puts in an array ref.
    {AuthorList => {CompleteYN=>"Y", Author => [{ValidYN=>"Y", LastName=>"Tsirgos", ForeName=>"K T"}, {ValidYN=>"Y", LastName=>"Dinos", ForeName=>"A E"}]}}
    Use the 'ForceArray' option on XMLin so that author always gets the array ref. That way your code doesn't choke on one case or the other.

    The notes to XML::Simple say that they wish they'd made ForceArray the default. I guess there's too much established code out there to change the default now.

    Read up on it. BTW, while you're there, read about KeyAttr, too. You'll want to use it.

    throop

      Making it a default for all tags, including those that are no allowed to be repeated in the XML would cause the resulting data structure to be overly complex and hard to navigate. You would not want to have to write $data->{foo}[0]{bar}[0]{baz}[0] instead of $data->{foo}{bar}{baz}, especially if you knew for sure non of the tags in question can ever be repeated.

      Do use ForceArray, but only use it for the tags that can be repeated. And there, us it for all such tags.

Re: Perl & Simple::XML
by graff (Chancellor) on Oct 06, 2007 at 00:22 UTC
    Since you already have use Data::Dumper in your script, you could just add this line, and the contents of $data should become clear:
    print Dumper( $data );
    The output from that call would show you that you reach Author entries like this:
    my @authors = @{$data->{Article}{AuthorList}{Author}}; print Dumper( \@authors );
    (Updated -- ( -> { -- as per AM's replies. Sorry about that!)
      you can try and use namespaces in yor xml - it will help you to avoid clashes diring xml parsing. XML-Parser supports namespaces so there shouldnt be a problem. namespaces are like lexical variabe solution in perl (my).
      Hi, I believe what you suggested probably works, but it says that I have a syntax error in the line: my @authors = @($data->{Article}{AuthorList}{Author}}; I tried several things but nothing doesn't seem to work... Pls advice...
        @{} not @(}
Re: Perl & Simple::XML--->SOS
by Anonymous Monk on Oct 06, 2007 at 08:50 UTC
    Helo again! Everything works normal now. My code is:
    #!/usr/bin/perl # use module use XML::Simple; use Data::Dumper; # create object $xml = new XML::Simple (KeyAttr=>[]); #read XML file $data = $xml->XMLin("data.xml"); #dereference hash ref #print Dumper($data); print $data->{PMID}, "\n"; print $data->{Article}->{ArticleTitle}, "\n"; foreach $e (@{$data->{Article}->{AuthorList}->{Author}}) { $authors.= $e->{LastName}." ".$e->{Initials}.', '; } print $data->{Article}->{Journal}->{ISOAbbreviation}, " "; print $data->{Article}->{Journal}->{JournalIssue}->{PubDate}->{Year}, +";", ; print $data->{Article}->{Journal}->{JournalIssue}->{Volume}, ":"; print $data->{Article}->{Pagination}->{MedlinePgn}, "." ; print "\n";
    and gets all the details I need. BUT there is a problem when I try to put 2 XMLs together, like:
    <?xml version='1.0'?> <PubmedArticle> <PMID>1766380</PMID> <Article PubModel="Print"> <Journal> <JournalIssue CitedMedium="Print"> <Volume>5</Volume> <Issue>9</Issue> <PubDate> <Year>1991</Year> <Month>Sep</Month> </PubDate> </JournalIssue> <ISOAbbreviation>Mol. Microbiol.</ISOAbbreviation> </Journal> <ArticleTitle>PhoP/PhoQ: macrophage-specific modulator +s of Salmonella virulence?</ArticleTitle> <Pagination> <MedlinePgn>2073-8</MedlinePgn> </Pagination> <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Miller</LastName> <ForeName>S I</ForeName> <Initials>SI</Initials> </Author> <Author ValidYN="Y"> <LastName>Tsirigos</LastName> <ForeName>K T</ForeName> <Initials>KT</Initials> </Author> <Author ValidYN="Y"> <LastName>Dinous</LastName> <ForeName>A E</ForeName> <Initials>AE</Initials> </Author> </AuthorList> </Article> </PubmedArticle> <PubmedArticle> <MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID>16039843</PMID> <DateCreated> <Year>2005</Year> <Month>08</Month> <Day>01</Day> </DateCreated> <DateCompleted> <Year>2005</Year> <Month>12</Month> <Day>08</Day> </DateCompleted> <DateRevised> <Year>2006</Year> <Month>11</Month> <Day>15</Day> </DateRevised> <Article PubModel="Print"> <Journal> <ISSN IssnType="Print">0959-440X</ISSN> <JournalIssue CitedMedium="Print"> <Volume>15</Volume> <Issue>4</Issue> <PubDate> <Year>2005</Year> <Month>Aug</Month> </PubDate> </JournalIssue> <Title>Current opinion in structural biology</Titl +e> <ISOAbbreviation>Curr. Opin. Struct. Biol.</ISOAbb +reviation> </Journal> <ArticleTitle>TonB-dependent outer membrane transport: + going for Baroque?</ArticleTitle> <Pagination> <MedlinePgn>394-400</MedlinePgn> </Pagination> <Abstract> <AbstractText>The import of essential organometall +ic micronutrients (such as iron-siderophores and vitamin B(12)) acros +s the outer membrane of Gram-negative bacteria proceeds via TonB-depe +ndent outer membrane transporters (TBDTs). The TBDT couples to the To +nB protein, which is part of a multiprotein complex in the plasma (in +ner) membrane. Five crystal structures of TBDTs illustrate clearly th +e architecture of the protein in energy-independent substrate-free an +d substrate-bound states. In each of the TBDT structures, an N-termin +al hatch (or plug or cork) domain occludes the lumen of a 22-stranded + beta barrel. The manner by which substrate passes through the transp +orter (the "hatch-barrel problem") is currently unknown. Solution NMR + and X-ray crystallographic structures of various TonB domains indica +te a striking structural plasticity of this protein. Thermodynamic, b +iochemical and bacteriological studies of TonB and TBDTs indicate fur +ther that existing structures do not yet capture critical energy-depe +ndent and in vivo conformations of the transport cycle. The reconcili +ation of structural and non-structural experimental data, and the una +mbiguous experimental elucidation of a detailed molecular mechanism o +f transport are current challenges for this field.</AbstractText> </Abstract> <Affiliation>Department of Molecular Physiology and Bi +ological Physics, University of Virginia, PO Box 800736, Charlottesvi +lle, VA 22908-0736, USA. mwiener@virginia.edu</Affiliation> <AuthorList CompleteYN="Y"> <Author ValidYN="Y"> <LastName>Wiener</LastName> <ForeName>Michael C</ForeName> <Initials>MC</Initials> </Author> </AuthorList> <Language>eng</Language> <GrantList CompleteYN="Y"> <Grant> <GrantID>DK 59999</GrantID> <Acronym>DK</Acronym> <Agency>NIDDK</Agency> </Grant> </GrantList> <PublicationTypeList> <PublicationType>Journal Article</PublicationType> <PublicationType>Research Support, N.I.H., Extramu +ral</PublicationType> <PublicationType>Research Support, U.S. Gov't, P.H +.S.</PublicationType> <PublicationType>Review</PublicationType> </PublicationTypeList> </Article> <MedlineJournalInfo> <Country>England</Country> <MedlineTA>Curr Opin Struct Biol</MedlineTA> <NlmUniqueID>9107784</NlmUniqueID> </MedlineJournalInfo> <ChemicalList> <Chemical> <RegistryNumber>0</RegistryNumber> <NameOfSubstance>Bacterial Outer Membrane Proteins +</NameOfSubstance> </Chemical> <Chemical> <RegistryNumber>0</RegistryNumber> <NameOfSubstance>Bacterial Proteins</NameOfSubstan +ce> </Chemical> <Chemical> <RegistryNumber>0</RegistryNumber> <NameOfSubstance>Membrane Proteins</NameOfSubstanc +e> </Chemical> <Chemical> <RegistryNumber>0</RegistryNumber> <NameOfSubstance>Multiprotein Complexes</NameOfSub +stance> </Chemical> <Chemical> <RegistryNumber>0</RegistryNumber> <NameOfSubstance>tonB protein, Bacteria</NameOfSub +stance> </Chemical> </ChemicalList> <CitationSubset>IM</CitationSubset> <MeshHeadingList> <MeshHeading> <DescriptorName MajorTopicYN="Y">Bacterial Outer M +embrane Proteins</DescriptorName> <QualifierName MajorTopicYN="N">chemistry</Qualifi +erName> <QualifierName MajorTopicYN="N">metabolism</Qualif +ierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="Y">Bacterial Protein +s</DescriptorName> <QualifierName MajorTopicYN="N">chemistry</Qualifi +erName> <QualifierName MajorTopicYN="N">metabolism</Qualif +ierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="N">Biological Transp +ort</DescriptorName> <QualifierName MajorTopicYN="N">physiology</Qualif +ierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="N">Crystallography, +X-Ray</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="Y">Membrane Proteins +</DescriptorName> <QualifierName MajorTopicYN="N">chemistry</Qualifi +erName> <QualifierName MajorTopicYN="N">metabolism</Qualif +ierName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="N">Models, Molecular +</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="N">Multiprotein Comp +lexes</DescriptorName> </MeshHeading> <MeshHeading> <DescriptorName MajorTopicYN="Y">Protein Conformat +ion</DescriptorName> </MeshHeading> </MeshHeadingList> <NumberOfReferences>38</NumberOfReferences> </MedlineCitation> <PubmedData> <History> <PubMedPubDate PubStatus="received"> <Year>2005</Year> <Month>6</Month> <Day>7</Day> </PubMedPubDate> <PubMedPubDate PubStatus="revised"> <Year>2005</Year> <Month>6</Month> <Day>18</Day> </PubMedPubDate> <PubMedPubDate PubStatus="accepted"> <Year>2005</Year> <Month>7</Month> <Day>8</Day> </PubMedPubDate> <PubMedPubDate PubStatus="pubmed"> <Year>2005</Year> <Month>7</Month> <Day>26</Day> <Hour>9</Hour> <Minute>0</Minute> </PubMedPubDate> <PubMedPubDate PubStatus="medline"> <Year>2005</Year> <Month>12</Month> <Day>13</Day> <Hour>9</Hour> <Minute>0</Minute> </PubMedPubDate> </History> <PublicationStatus>ppublish</PublicationStatus> <ArticleIdList> <ArticleId IdType="pii">S0959-440X(05)00124-7</Article +Id> <ArticleId IdType="doi">10.1016/j.sbi.2005.07.001</Art +icleId> <ArticleId IdType="pubmed">16039843</ArticleId> </ArticleIdList> </PubmedData> </PubmedArticle>
    I get the error Only Comments, PIs and whitespace allowed at end of document [Ln: 40, Col: 1]
    Line 40 is the line when the other </PubmedArticle> element begins... Do I have any mistake?
      Do I have any mistake?
      I don't think you can have 2 root-nodes in an XML document.

      -David

        No, thanks, it was a silly typo mistake... :) It works ok now, but the problem remains with the authors...
        In particular, the above code (which I created based on your help) works fine when I have more than one author. But, if I have only one, it says :  Not an ARRAY reference at read_xml.pl line 19 (line 19 is the foreach loop).
        So, I was wondering, is there a way of finding out if I have one or more authors in my xml entry? Thank you all for your time!
Re: Perl & Simple::XML
by Cop (Initiate) on Oct 06, 2007 at 00:20 UTC

    My first thought was to use data dumper so you can understand the parsed structure, then noticed that you actually have it. Your issue is not with XML but perl data structure, go read about array.