bobn has asked for the wisdom of the Perl Monks concerning the following question:
So I started playign around with XML parsing (well HTML but it's well enough formed I can use XML parsers on it). I ran into something on the Perl side of things I don't understand.
I get a nodeset, start walking through it and getting text out, but when it comes out, for each node I get the text contained in node element AND the text of all of it's descendants (contained elements).
I'm getting this with XML::LibXML::XPathContext, but it happens with XML::XPath as well.
The event driven parsers I've tried don't seem to have this issue - they think that Text belongs to the innermost containing element, just like I do. lxml.etree in python, their binding for libxml2, does not do this, (though it definitely has oddities of it's own - check out "tail text" sometime, it's a doozy!).
I'm going to stop now, 'coz I'm becoming increasingly sure I'm just missing something stupidly.
Is it supposed to do this, and if so, how do I get at just the text for the outermost element of my node?
Produces:#!/usr/bin/perl use XML::LibXML::XPathContext; our $contents = <<EOT; <html> <head> <title>Title_Text</title> </head> <body> <p>paragraph_text</p> <div> <div> innnermost_text </div> </div> </body> </html> EOT open my $fh, '>', './x.html'; print $fh $contents; close $fh; my $init_node = XML::LibXML->new->parse_file('./x.html'); my $xp = XML::LibXML::XPathContext->new($init_node); my $i= 0; my $nodeset = $xp->findnodes('//*'); for my $node ($nodeset->get_nodelist) { my $elname = $node->getName(); print qq[<$elname> node - $i\n]; my $text = ''; $text = $node->string_value(); # this brings in text of # *all* descendant nodes $text =~ s/(\s)+/$1/msg; print 'Text = ', $text, "\n"; $i++; }
<html> node - 0 Text = Title_Text paragraph_text innnermost_text <head> node - 1 Text = Title_Text <title> node - 2 Text = Title_Text <body> node - 3 Text = paragraph_text innnermost_text <p> node - 4 Text = paragraph_text <div> node - 5 Text = innnermost_text <div> node - 6 Text = innnermost_text
--Bob Niederman,
All code given here is UNTESTED unless otherwise stated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: XML::LibXML::XPathContext->string_value - should ALL of the descendant's text be there?
by haukex (Archbishop) on Aug 08, 2020 at 07:25 UTC | |
by bobn (Chaplain) on Aug 09, 2020 at 00:11 UTC | |
by haukex (Archbishop) on Aug 09, 2020 at 08:27 UTC | |
by bobn (Chaplain) on Aug 10, 2020 at 06:01 UTC |