in reply to Re^2: XML::LibXML::Reader giving wrong matched element
in thread XML::LibXML::Reader giving wrong matched element

What is it that distinguishes the <state...> and <city...> tags from the <country...> tags? Is it strictly that the OP's code provides the shortcut close, "/>" for state and city but not for country? If <country...> had a shortcut close would it not need a </county> tag? And if so, why not use a shortcut close globally -- that is, on <world> and <country>. I still don't "get it" in that regard.

It seems to me that consistency would make parsing easier... and might even help explain why the OP (you?) is seeing unexpected numbers of elements.

I'm quite curious, because a simpleminded search on "XML close tag" produced a selection of inconsistent assertions.

On the first paw, your explanation doesn't seem consistent with beginner tuts like that at http://www.w3schools.com/xml/xml_syntax.asp nor with http://www.w3schools.com/xml/xml_dtd.asp nor http://www.xmlfiles.com/xml/xml_syntax.asp -- none of which are authoratative (but I'm too full of turkey to chase it down -- and while you may suspect a turkey byproduct, that's another discussion). All of those agree that the only or chief exception to a "must have a closing tag" rule is the <empty-element />

But on the hind paw, the XML validator at http://www.w3schools.com/xml/xml_validator.asp passes, as "well formed," the OP's code, when that is modified with a leading <?xml version...> header, and has the elipsis replaced with arbitary sample data.

Thus, while I'm still uncertain "why" and "how" your take on the matter can be true, I won't dispute it (at least for the moment).

I will, however, quibble with your assertions about html. They're good examples of the point you're making... but they're NOT entirely correct. The standards for 4.01 transitional and 4.01 strict differ on what's required, where. Your link example is correct ("valid") in both; the shortcut close on image is NOT required by 4.01 transitional (aka "loose"). And html5 is a fish with different feathers.

In any case,, if you posted the OP as an AM and are now expanding on that post, please provide the sample output requested above... and, whether you are the OP or not, thank you for taking the time and effort to reply.

  • Comment on Re^3: XML::LibXML::Reader giving wrong matched element

Replies are listed 'Best First'.
Re^4: XML::LibXML::Reader giving wrong matched element
by ikegami (Patriarch) on Nov 25, 2011 at 02:11 UTC

    What is it that distinguishes the <state...> and <city...> tags from the <country...> tags? Is it strictly that the OP's code provides the shortcut close, "/>" for state and city but not for country?

    Yes.

    If <country...> had a shortcut close would it not need a </county> tag?

    <foo x="y"/> and <foo x="y"></foo> are completely equivalent, so not only would it not need a </country> tag, it could not have a </country> tag. One can't close an element more than once.

    your explanation doesn't seem consistent with beginner tuts

    I presume you are referring to "all XML elements must have a closing tag".

    That claim is true, but <foo/> serves as both the opening and closing tag of the element, so it satisfies the requirement of the presence of a closing tag.

    why not use a shortcut close globally -- that is, on <world> and <country>

    That would be impossible because the world and country elements have non-attribute children.

    In fact, I'd say the city elements are misplaced in the OP's XML. The indenting indicates the OP wants them to be children of states, but he made them children of countries.

    <country short="usa" name="united state of america"> <state short="CA" name="california"/> <city short="SFO" name="San Franscisco"/> <city short="EM" name="Emeryville"/> <state short="FL" name="florida"/> ... More intermixed states and cities ... </country>

    means

    <country short="usa" name="united state of america"> <state short="CA" name="california"></state> <city short="SFO" name="San Franscisco"></city> <city short="EM" name="Emeryville"></city> <state short="FL" name="florida"/></state> ... More intermixed states and cities ... </country>

    but he surely wants

    <country short="usa" name="united state of america"> <state short="CA" name="california"> <city short="SFO" name="San Franscisco"/> <city short="EM" name="Emeryville"/> </state> <state short="FL" name="florida"> ... More cities ... </state> ... More states ... </country>

    the shortcut close on image is NOT required by 4.01 transitional (aka "loose").

    That's not right.

    SGMLHTML5
    HTML
    Serialisation
    XML
    HTML4OtherXHTML1
    strict
    XHTML1
    transitional
    HTML5Any other
    XML schema
    stricttransitional
    <br>Well-formed and Valid[Varies]ValidMalformed
    <p>Well-formed and ValidValid
    <div>Well-formed but InvalidInvalid
    <br/>MalformedTolerated*Well-formed and Valid
    <p/>Invalid
    <div/>Invalid
    <br></br>Well-formed but InvalidInvalidWell-formed and Valid
    <p></p>Well-formed and ValidValid
    <div></div>Well-formed and ValidValid

    Note that browsers are very forgiving and accept all kinds of malformed and invalid HTML.

    As an aside, the table clearly highlights XHTML's advantage over HTML: simplicity. The cost, of course, is that XHTML is more wordy. (Like Java vs Perl?)

    * — The HTML serialisation of HTML5 accepts "/" on elements that cannot have a closing tag (area, base, br, col, command, embed, hr, img, input, keygen, link, meta, param, source, track, wbr). (ref)

    The standards for 4.01 transitional and 4.01 strict differ on what's required

    They differ on what constitutes a valid HTML or XHTML document (i.e. what elements and attributes are allowed), but they do not differ on what constitutes a well-formed HTML or XML documents (i.e. on what is valid syntax).

      ikegami
      Thanks for the clarifications re XML; I think I have a general idea of the meaning of your "non-attribute children" (but shall have to look further, to be sure). But the rest is crystal clear. Again, thank you for putting so much information into your reply.

      But, I wonder if I was unclear about the "shortcut close" ( ".../>") for <img src="foo.jpg alt=... > as your table does not illustrate it. My assertion that 'the shortcut close on image is NOT required by 4.01 transitional (aka "loose")' is supported by the likes of Dave Raggett (at http://www.w3.org/MarkUp/Guide/ for example) and -- more important -- in the "HTML 4.01 Specification, W3C Recommendation 24 December 1999" (at http://www.w3.org/TR/REC-html40/) which links to an illustration of the use of at http://www.w3.org/TR/REC-html40/struct/objects.html.

      Granted, these are both decade-old documents, but I find nothing to countenance the shortcut close under 4.01 transitional nor any indication of any substantive difference on this point between the proposal cited and current standards -- for html 4.01 transitional.

      Update: In fact, what seems to me conclusive is the statement in the very latest 4.01 spec (at http://www.w3.org/TR/1999/REC-html401-19991224/struct/objects.html) re the tag:

      Start tag: required, End tag: forbidden

      the emphasis is in the original.

      Usually, when I make such a statement in disagreement with something you've said, it merely proves that I've missed something crucial. Is that the case here, and if so, would you be so good as to point me (and future readers) to it?

        I didn't use IMG because it has required attributes, and I didn't want that to become an issue. In other aspects, IMG is like BR. Refer to the rows for BR.

        My assertion that 'the shortcut close on image is NOT required by 4.01 transitional (aka "loose")' is supported by the likes of Dave Raggett (at http://www.w3.org/MarkUp/Guide/ for example)

        Saying "not required by 4.01 transitional" implies "allowed by by 4.01 transitional", and that's not case. It's not allowed in HTML. The linked document is completely silent on the subject.

        And again, whether it's the transitional or strict makes no difference whatsoever here, since they don't affect syntax.

        In fact, what seems to me conclusive is the statement in the very latest 4.01 spec (at http://www.w3.org/TR/1999/REC-html401-19991224/struct/objects.html) re the <img ...> tag: Start tag: required, End tag: forbidden

        Looking at the definition of an element is irrelevant because <foo/> is never well-formed HTML.