Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Here is my XML which I am parsing using XML::LibXML::Reader
<world> <country short="usa" name="united state of america"> <state short="CA" name="california"/> <city short="SFO" name="San Franscisco"/> <city short="EM" name="Emeryville"/> <state short="FL" name="florida"/> <city .../> . <city ../> </country> <country short="abc" name="a for apple"> <state ..../> </country> </world>
and here is the code
use XML::LibXML::Reader; my $reader = XML::LibXML::Reader->new(location => "map.xml"); my $pattern = XML::LibXML::Pattern->new('/world'); my @matchedNodes; while ($reader->nextPatternMatch($pattern)) { push @matchedNodes, $reader->copyCurrentNode(1); }

@matchedNodes give me two elements. why? There is only one world tag. What is wrong with my code?

similarly when I use the pattern my $pattern = XML::LibXML::Pattern->new('/world/country');

It give me four elements whereas I am having only two country tags.

Please explain me where am I doing wrong? I need to use Pattern (for xPath) and I can not avoid it. Also, I would like to stick with XML::LibXML::Reader for some comtability reasons.

Please help.

Replies are listed 'Best First'.
Re: XML::LibXML::Reader giving wrong matched element
by choroba (Cardinal) on Nov 24, 2011 at 14:45 UTC
    XML::LibXML::Reader is a pull parser. It lets you do something on the closing tag, too. If you are not interested in closing tags, just add
    unless $reader->nodeType == XML_READER_TYPE_END_ELEMENT;
    after the push.
Re: XML::LibXML::Reader giving wrong matched element
by ww (Archbishop) on Nov 24, 2011 at 14:59 UTC
    You're going to have to help us (well, /me, anyway) to understand what your mean by "@matchedNodes give me two elements" -- and that's another way for me to say, "please tell us your output (and error messages, if any)."

    And then too, though I don't know for sure, I think your XML is NOT valid. Don't you need </state> and </city> tags for each state and city entry?

    Rephrased: for clarity, the paraphrase of the quote in the first para.

      Don't you need </state> and </city> tags for each state and city entry?

      Actually, no you don't. These are self-closing tags (can't remember the technical term). You use them for tags hold no other tags or values. I'm not sure how to translate that to english. Let's try an example.

      Let's use some HTML tags for this XML example for simplicity. The classic open/close tag would be a link:

      <a href="/hello">Inner value</a>
      And then there is the classic image tag.
      <img src="monk.gif"/>
      The slash at the end recloses the opening tag.

      I'm not an expert in this, so i just hope i didn't mess this up. Because if i did, i have to rewrite like 50 XML files tomorrow....

      Don't use '#ff0000':
      use Acme::AutoColor; my $redcolor = RED();
      All colors subject to change without notice.
        What is it that distinguishes the <state...> and <city...> tags from the <country...> tags? Is it strictly that the OP's code provides the shortcut close, "/>" for state and city but not for country? If <country...> had a shortcut close would it not need a </county> tag? And if so, why not use a shortcut close globally -- that is, on <world> and <country>. I still don't "get it" in that regard.

        It seems to me that consistency would make parsing easier... and might even help explain why the OP (you?) is seeing unexpected numbers of elements.

        I'm quite curious, because a simpleminded search on "XML close tag" produced a selection of inconsistent assertions.

        On the first paw, your explanation doesn't seem consistent with beginner tuts like that at http://www.w3schools.com/xml/xml_syntax.asp nor with http://www.w3schools.com/xml/xml_dtd.asp nor http://www.xmlfiles.com/xml/xml_syntax.asp -- none of which are authoratative (but I'm too full of turkey to chase it down -- and while you may suspect a turkey byproduct, that's another discussion). All of those agree that the only or chief exception to a "must have a closing tag" rule is the <empty-element />

        But on the hind paw, the XML validator at http://www.w3schools.com/xml/xml_validator.asp passes, as "well formed," the OP's code, when that is modified with a leading <?xml version...> header, and has the elipsis replaced with arbitary sample data.

        Thus, while I'm still uncertain "why" and "how" your take on the matter can be true, I won't dispute it (at least for the moment).

        I will, however, quibble with your assertions about html. They're good examples of the point you're making... but they're NOT entirely correct. The standards for 4.01 transitional and 4.01 strict differ on what's required, where. Your link example is correct ("valid") in both; the shortcut close on image is NOT required by 4.01 transitional (aka "loose"). And html5 is a fish with different feathers.

        In any case,, if you posted the OP as an AM and are now expanding on that post, please provide the sample output requested above... and, whether you are the OP or not, thank you for taking the time and effort to reply.

Re: XML::LibXML::Reader giving wrong matched element
by locked_user sundialsvc4 (Abbot) on Nov 24, 2011 at 14:20 UTC

    It would be useful to us if you could show us the other attributes of the various tags that you’re getting as results.   “You get four elements...”   Show us what they are ... let’s say with Data::Dumper ...