This regex will remove valid HTML comments:

s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{}gmsx

Possibly a typo in mocked-up test HTML, but this is not a valid HTML comment:

<!-- testing test-->

It's invalid because there's no space before -->.

Here's the section of the W3C HTML Recommendation dealing with the syntax of HTML comments.

You've also posted comments that seem to indicate that you want Javascript removed but your sample output doesn't bear that out. Please clarify this point.

Update:

There appears to be some disagreement over what constitutes a valid HTML comment.

I used the following code to test my solution:

#!perl use 5.12.0; use warnings; { local $/ = undef; open my $fh, '<', $ARGV[0] or die $!; (my $html = <$fh>) =~ s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{ +}gmsx; close $fh; say $html; }

This produced the OP's "Desired Output" with the exception of

<!-- testing test-->

remaining in the output.

I then checked the W3C reference document (linked above) which states:

HTML comments have the following syntax:

<!-- this is a comment --> <!-- and so is this one, which occupies more than one line -->

Note the whitespace between comment and --> in both cases. Also note that the documentation makes no further reference to whitespace in that position.

If anyone has more definitive information (e.g. Backus-Naur Form notation), a link to that would be useful and welcome.

For the OP: to also remove that remaining comment, regardless of whether it's valid or not, just change the \s to \s* in the regex:

s{<!-- \s ( . (?!-- \s* >) )* \s* -- \s* >}{}gmsx

-- Ken


In reply to Re: HTML stripper... by kcott
in thread HTML stripper... by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.