Consider this piece of HTML: <p>Hello, <b>World</b>!</p> - the <p> element has three child nodes: a text node ("Hello, "), the <b> element, and another text node ("!"). But if you ask the question "what text is contained in that paragraph?" then wouldn't the natural answer be "Hello, World!" instead of "Hello, !"?
Actually, I *do* believe that the text "World" belongs to the b element and, therefore, not to the p element. Printing out specifically the p element's text should not include it. Sure as hell, printing the text for <html> should not print out all the text of all the descendent nodes.

I've tried 3 other pieces of code - HTML::Parser and, in Python lxml.etree (bindings to libxml2, as is XML::LibXML) and xml.parsers.expat, comparable to HTML::Parser. They all agree that text belongs to the innermost containing element, and no other. (Well except lxml.etree, which thinks that elements mixed in with the text of a parent element somehow suck up the text after them in something known as "tail text" - I never heard of it before and it's really hard to find anything about it on the internet that *isn't* associated with lxml and Python. I think they just made that crap up.) So that's where I am, 3 other pieces of software disagree with this one - and I can't see that I've done anything incorrectly.

--Bob Niederman, http://bob-n.com

All code given here is UNTESTED unless otherwise stated.


In reply to Re^2: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there? by bobn
in thread XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there? by bobn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.