http://qs1969.pair.com?node_id=923708

djlerman has asked for the wisdom of the Perl Monks concerning the following question:

I need to remove an HTML tag with everything in between. I thought the best way was to use HTML::TokeParser; or REGEX. In HTML::TokeParser; I can't figure out how to remove a specific tag as well as the content.

In REGEX I can't get search and replace to work. Example follows...

$content ="<div id="content"> BLa Bla Bla <div id='print'> *** TEXT TO BE REMOVED *** *** CODE TO BE REMOVED *** *** FORMATTING TO BE REMOVED *** </div> bla bla bla bla </div>"; $content =~ s/<div id='print'>(.*?)<\/div>//gis; print $content;
  • Comment on Removing HTML beginning and ending tag with everything in between?
  • Download Code

Replies are listed 'Best First'.
Re: Removing HTML beginning and ending tag with everything in between?
by ikegami (Patriarch) on Sep 01, 2011 at 20:23 UTC

    For that very specific text, your substitution does work. (There's a syntax error building your string because you use " as a delimited and you didn't escape the " characters within the string.)

    Using XML::LibXML (which has an HTML parser), it would be:

    for my $node ($root->findnodes('//div[@id="print"]')) { $node->parentNode()->removeChild($node); }
Re: Removing HTML beginning and ending tag with everything in between?
by Kc12349 (Monk) on Sep 01, 2011 at 20:52 UTC

    Take a look at your single quotes versus double-quotes. You have double around "content" and single around 'print'. Other than that my output for your regex code is below. Is this not what you are looking for?

    <div id="content"> BLa Bla Bla bla bla bla bla </div>