in reply to HTML stripper...
This regex will remove valid HTML comments:
s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{}gmsx
Possibly a typo in mocked-up test HTML, but this is not a valid HTML comment:
<!-- testing test-->
It's invalid because there's no space before -->.
Here's the section of the W3C HTML Recommendation dealing with the syntax of HTML comments.
You've also posted comments that seem to indicate that you want Javascript removed but your sample output doesn't bear that out. Please clarify this point.
Update:
There appears to be some disagreement over what constitutes a valid HTML comment.
I used the following code to test my solution:
#!perl use 5.12.0; use warnings; { local $/ = undef; open my $fh, '<', $ARGV[0] or die $!; (my $html = <$fh>) =~ s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{ +}gmsx; close $fh; say $html; }
This produced the OP's "Desired Output" with the exception of
<!-- testing test-->
remaining in the output.
I then checked the W3C reference document (linked above) which states:
HTML comments have the following syntax:
<!-- this is a comment --> <!-- and so is this one, which occupies more than one line -->
Note the whitespace between comment and --> in both cases. Also note that the documentation makes no further reference to whitespace in that position.
If anyone has more definitive information (e.g. Backus-Naur Form notation), a link to that would be useful and welcome.
For the OP: to also remove that remaining comment, regardless of whether it's valid or not, just change the \s to \s* in the regex:
s{<!-- \s ( . (?!-- \s* >) )* \s* -- \s* >}{}gmsx
-- Ken
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: HTML stripper...
by JavaFan (Canon) on Nov 22, 2010 at 11:18 UTC | |
by kcott (Archbishop) on Nov 22, 2010 at 23:57 UTC | |
by JavaFan (Canon) on Nov 23, 2010 at 00:36 UTC | |
|
Re^2: HTML stripper...
by JavaFan (Canon) on Nov 23, 2010 at 00:42 UTC | |
|
Re^2: HTML stripper...
by Argel (Prior) on Nov 22, 2010 at 21:06 UTC | |
by kcott (Archbishop) on Nov 22, 2010 at 23:39 UTC |