in reply to HTML stripper...

This regex will remove valid HTML comments:

s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{}gmsx

Possibly a typo in mocked-up test HTML, but this is not a valid HTML comment:

<!-- testing test-->

It's invalid because there's no space before -->.

Here's the section of the W3C HTML Recommendation dealing with the syntax of HTML comments.

You've also posted comments that seem to indicate that you want Javascript removed but your sample output doesn't bear that out. Please clarify this point.

Update:

There appears to be some disagreement over what constitutes a valid HTML comment.

I used the following code to test my solution:

#!perl use 5.12.0; use warnings; { local $/ = undef; open my $fh, '<', $ARGV[0] or die $!; (my $html = <$fh>) =~ s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{ +}gmsx; close $fh; say $html; }

This produced the OP's "Desired Output" with the exception of

<!-- testing test-->

remaining in the output.

I then checked the W3C reference document (linked above) which states:

HTML comments have the following syntax:

<!-- this is a comment --> <!-- and so is this one, which occupies more than one line -->

Note the whitespace between comment and --> in both cases. Also note that the documentation makes no further reference to whitespace in that position.

If anyone has more definitive information (e.g. Backus-Naur Form notation), a link to that would be useful and welcome.

For the OP: to also remove that remaining comment, regardless of whether it's valid or not, just change the \s to \s* in the regex:

s{<!-- \s ( . (?!-- \s* >) )* \s* -- \s* >}{}gmsx

-- Ken

Replies are listed 'Best First'.
Re^2: HTML stripper...
by JavaFan (Canon) on Nov 22, 2010 at 11:18 UTC
    It's invalid because there's no space before -->
    That's bogus. There's no need for space to be there. Nor does there have to be space as the first character following a COM sequence (COM being --).

    OTOH, your pattern falsely considers <!-- -- --> to be a valid comment, while it doesn't consider <!-- <!-- --> --> to be valid.

    This matches HTML comments:

    <!(?:--(?:[^-]*(?:-[^-]+)*)--\s*)*>
    although if you are truely pedantic, you'd replace the \s with the set of characters the HTML DTD defines as white space characters.

      Firstly, I've added an update to my post, please read that.

      Secondly, rather than just stating "That's bogus ...", perhaps you could cite a reference.

      -- Ken

        perhaps you could cite a reference.
        Rules 91 and 92 of ISO 8879 (SGML).

        Charles F. Goldfarb: The SGML Handbook. Oxford: Oxford University Press. 1990. ISBN 0-19-853737-9. Ch. 10.3, pp 390.

Re^2: HTML stripper...
by JavaFan (Canon) on Nov 23, 2010 at 00:42 UTC
    Note the whitespace between comment and --< in both cases. Also note that the documentation makes no further reference to whitespace in that position.
    Note also that all the examples you cite lack capital letters. And I'm pretty sure the documentation makes no further mention of capital letters in comments - with your logic, they're forbidden. In fact, following your logic, there are only two HTML comments: the examples from the documentation.
Re^2: HTML stripper...
by Argel (Prior) on Nov 22, 2010 at 21:06 UTC
    Did you even read what you linked to? There is nothing in there about requiring a space before the closing dashes. The only whitespace rules they mention are basically:
    Legal: "<!--" Illegal: "<! --" -and- Legal: "-->" Legal: "-- >"

    Elda Taluta; Sarks Sark; Ark Arks

      "Did you even read what you linked to?"

      That's fairly unpleasant, bordering on rudeness.

      Note the <!-- (at the start of the regex) and the -- \s* > (at the end of the regex) which deals with those rules.

      Also take a look at my updated post which indicates more of what I read.

      -- Ken