Cody Fendant has asked for the wisdom of the Perl Monks concerning the following question:

Here's a minimal example: say I have two paragraphs:

<p> Paragraph one here. </p> <p> Paragraph <b>two</b> here. </p>

And I use Mojo::DOM to grab their text:

use Mojo::DOM; my $dom = Mojo::DOM->new('<p>Paragraph one here.</p><p>Paragraph <b>two</b> he +re.'); for my $e ( $dom->find('p')->each ) { print $e->text,$/; } ### Output: # Paragraph one here. # Paragraph here. #

How do I access that paragraph's complete text, including the text inside that second level of markup? And is this a bug or a feature?

Replies are listed 'Best First'.
Re: Mojo::DOM doesn't include marked-up text in an element's text
by choroba (Cardinal) on Apr 22, 2020 at 22:51 UTC
    It's a feature. XML::LibXML behaves similarly:
    #!/usr/bin/perl use strict; use warnings; use feature qw{ say }; use XML::LibXML; my $xml = '<r><p>Paragraph one here.</p><p>Paragraph <b>two</b> here.< +/p></r>'; my $dom = 'XML::LibXML'->load_xml(string => $xml); print $dom->findvalue('/r/p[2]'); # Same as $dom->findnodes('/r/p[2]/ +/text()') # Paragraph two here. print $dom->findnodes('/r/p[2]'); # Same as map $_->toString, $dom->f +indnodes('/r/p[2]') # <p>Paragraph <b>two</b> here.</p> print $dom->findnodes('/r/p[2]/text()'); # Paragraph here

    What do you mean by "complete text"?

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      "What do you mean by "complete text"?"

      What anonymous said below, the method Cody Fendant should have used to get the combined text for all descending nodes is all_text, rather than text, so

      print $e->text,$/;

      becomes

      print $e->all_text,$/;

      Very handy, even for one liners/ojo use.

Re: Mojo::DOM doesn't include marked-up text in an element's text (all_text)
by Anonymous Monk on Apr 23, 2020 at 02:06 UTC

    Always check assumptions against docs

    docs man docs Mojo::DOM

    all_text my $text = $dom->all_text; Extract text content from all descendant nodes of this element. text my $text = $dom->text; Extract text content from this element only (not including child elements).

      Thanks! And damn, I can't believe I didn't spot that in the documentation.

      To be fair, a simple "see all_text" in the documentation next to text would have saved me a lot of frustration!

        That's just one of the reasons why I miss Annocpan so much.