Thanks for the previous.

I have a question about either HTML::TreeBuilder::XPath or HTML::Element, and the interaction between them. I would like to manipulate the content of an element while leaving all its children in place. I'm not able to find a way around that because it appears that replace_with() also automatically and unavoidably escapes the < and > signs. The example below uses ~literal but I've also tried creating a new element. Either way, the child elements within the selected element get escaped despite my best efforts. How would it be possible to do something like the following (using a different work flow if necessary) such that the tags for the child elements remain intact and unescaped?

#!/usr/bin/perl use HTML::TreeBuilder::XPath; use HTML::Element; use warnings; use strict; my $xhtml = HTML::TreeBuilder::XPath->new; $xhtml->implicit_tags(1); $xhtml->no_space_compacting(1); $xhtml->parse_file(\*DATA) or die("Could not parse file handle for 'DATA' : $!\n"); for my $item ($xhtml->findnodes('//div/ul/li')) { my $li = $item->as_XML; $li =~ s/^\s+//; # ... omitting rest of the stuff which happens to $li ... my $new = HTML::Element->new('~literal', 'text' => $li); $item->replace_with($new); } print $xhtml->as_XML_indented; $xhtml->delete; exit(0); __DATA__ <html> <head> <title>Foo Bar</title> </head> <body> <div><a href=" http://foo.example.com/ ">Foo Bar</a> <ul> <li> foo foo foo foo <em>bar</em> foo foo foo foo foo </li></ul></div> <div><a href=" http://bar.example.com/ ">Bar Foo</a> <ul> <li> foo foo foo foo <em>bar</em> foo foo foo foo foo <ul> <li>alpha</li> <li>b<em>et</em>a</li> <li>gamma</li> </ul> </li></ul></div> </body> </html>

The output I get is as follows:

<html> <head> <title>Foo Bar</title> </head> <body> <div><a href=" http://foo.example.com/ ">Foo Bar</a> <ul>&lt;li&gt; foo foo foo foo &lt;em&gt;bar&lt;/em&gt; foo foo foo foo foo &lt;/li&gt; </ul> </div> <div><a href=" http://bar.example.com/ ">Bar Foo</a> <ul>&lt;li&gt; foo foo foo foo &lt;em&gt;bar&lt;/em&gt; foo foo foo foo foo &lt;ul&gt;&lt;li&gt;alpha&lt;/li&gt;&lt;li&gt;b&lt;em&gt;et&lt;/e +m&gt;a&lt;/li&gt;&lt;li&gt;gamma&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt; </ul> </div> </body> </html>

The output I would like to get instead would look like this:

<html> <head> <title>Foo Bar</title> </head> <body> <div><a href=" http://foo.example.com/ ">Foo Bar</a> <ul><li>foo foo foo foo <em>bar</em> foo foo foo foo foo </li> </ul>HTML::TreeBuilder::XPath </div> <div><a href=" http://bar.example.com/ ">Bar Foo</a> <ul><li>foo foo foo foo <em>bar</em> foo foo foo foo foo <ul><li>alpha</li><li>b<em>et</em>a</li><li>gamma</li></ul></li> </ul> </div> </body> </html>

I'm not sure if HTML::TreeBuilder::XPath can be made to work like that. If it can, what has to change?


In reply to Avoiding escaped child elements with HTML::TreeBuilder::XPath or HTML::Element by mldvx4

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.