Removing HTML beginning and ending tag with everything in between?

djlerman has asked for the wisdom of the Perl Monks concerning the following question:

I need to remove an HTML tag with everything in between. I thought the best way was to use HTML::TokeParser; or REGEX. In HTML::TokeParser; I can't figure out how to remove a specific tag as well as the content.

In REGEX I can't get search and replace to work. Example follows...

$content ="<div id="content">
               BLa Bla Bla
               <div id='print'>
                     *** TEXT TO BE REMOVED *** 
                     *** CODE TO BE REMOVED *** 
                     *** FORMATTING TO BE REMOVED *** 
               </div> 
                bla bla bla bla
           </div>";

$content =~ s/<div id='print'>(.*?)<\/div>//gis;

print $content;
[download]

Comment on Removing HTML beginning and ending tag with everything in between? Download Code

Replies are listed 'Best First'.
Re: Removing HTML beginning and ending tag with everything in between? by ikegami (Patriarch) on Sep 01, 2011 at 20:23 UTC
For that very specific text, your substitution does work. (There's a syntax error building your string because you use `"` as a delimited and you didn't escape the `"` characters within the string.) Using XML::LibXML (which has an HTML parser), it would be: `for my $node ($root->findnodes('//div[@id="print"]')) { $node->parentNode()->removeChild($node); }` [download]	[reply] [d/l] [select]
Re: Removing HTML beginning and ending tag with everything in between? by Kc12349 (Monk) on Sep 01, 2011 at 20:52 UTC
Take a look at your single quotes versus double-quotes. You have double around "content" and single around 'print'. Other than that my output for your regex code is below. Is this not what you are looking for? `<div id="content"> BLa Bla Bla bla bla bla bla </div>` [download]	[reply] [d/l]

Back to Seekers of Perl Wisdom