HTML stripper...

stripping characters from html

jonnyfolk

getting and printing form values etc from html stripping out all else

kalkisong

Strip HTML, while preserving layout, with core(-ish) modules

Your Mother

RFC: HTML::StripScripts::LibXML

clinton

Med

clinton

Returning an XML::LibXML::DocumentFragment from HTML::StripScripts

strip html tags and special characters in perl while inserting the text in to database.

valavanp

HTML::Strip and UTF8 -- is there some way I can just skip all the "UTF8 only" entities?

tphyahoo

HTML::Strip question--stripping only certain tags?

Anonymous Monk

HTML::TokeParser not stripping entities and xhtml

bradcathey

HTML stripper in WWW::Mechanize doesn't seem to work

lampros21_7

Answer: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?

Stripping HTML tags efficiently

agynr

Answer: How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?

Anonymous Monk

HTML image tag stripping

Strip HTML line breaks from list of URLs

Anonymous Monk

Stripping a-href tags from an HTML document

keyDemun

Stripping of HTML content

Nemp

epoptai

Temporarily strip HTML

Stripping HTML tags with Regular Expressions.

dda

Strip HTML tags again

Answer: how-to strip empty HTML tags like b /b

Answer: how-to strip empty HTML tags like b /b

Answer: how-to strip empty HTML tags like b /b

how-to strip empty HTML tags like <b> </b>

CatQ

f0dder

How to strip HTML using latest module

HTML::FormatText stripping last word

aijin

Strip Brain-Damaged Mails of "HTML Alternative" Evilness

Answer: How can I strip away some nested markup code in html by perl, like <SCRIPT> ?

Answer: How can I strip away some nested markup code in html by perl, like <SCRIPT> ?

How can use Perl to strip away some nested HTML markup code, like <SCRIPT> ?

CatQ

by roboticus (Chancellor) on Nov 22, 2010 at 11:50 UTC

A dense list of URLs like that and no RickRoll? I'm shocked, SHOCKED!

...roboticus

by Anonymous Monk on Nov 22, 2010 at 12:04 UTC

Awooga. Awooga. Awooga! Censure. Censure. Censure. You shouted.

by Anonymous Monk on Nov 22, 2010 at 07:26 UTC

Hi all, I think you people are misunderstood my question.. I don't want to touch any html tags(stripping). I want to remove only the comments in html and javascript at a time... Thanks...

by Anonymous Monk on Nov 22, 2010 at 07:47 UTC

I want to remove only the comments in html and javascript at a time... Thanks...

So use one of the solutions from previously linked , but configure them to only remove comments and javascript

Re: HTML stripper...
by JavaFan (Canon) on Nov 22, 2010 at 11:23 UTC

It should remove the comments in each code.

and nothing else

by locked_user sundialsvc4 (Abbot) on Nov 23, 2010 at 21:36 UTC

Indeed. You are probably going to go straight to something like HTML::Parser ... a true parser that can invoke action-routines when a particular construct has been recognized, no matter what it actually takes to recognize the presence of a construct.

Re: HTML stripper...
by kcott (Archbishop) on Nov 22, 2010 at 08:22 UTC

This regex will remove valid HTML comments:

s{<!-- \s ( . (?!-- \s* >) )* \s -- \s* >}{}gmsx
[download]

Possibly a typo in mocked-up test HTML, but this is not a valid HTML comment:

<!-- testing 
test-->
[download]

It's invalid because there's no space before -->.

Here's the section of the W3C HTML Recommendation dealing with the syntax of HTML comments.

You've also posted comments that seem to indicate that you want Javascript removed but your sample output doesn't bear that out. Please clarify this point.

Update:

There appears to be some disagreement over what constitutes a valid HTML comment.

I used the following code to test my solution:

#!perl

use 5.12.0;
use warnings;

{
    local $/ = undef;

    open my $fh, '<', $ARGV[0] or die $!;

    (my $html = <$fh>) =~ s{<!-- \s ( . (?!-- \s* >) )*  \s -- \s* >}{
+}gmsx;

    close $fh;

    say $html;
}
[download]

This produced the OP's "Desired Output" with the exception of

<!-- testing
test-->
[download]

remaining in the output.

I then checked the W3C reference document (linked above) which states:

HTML comments have the following syntax:

<!-- this is a comment -->
<!-- and so is this one,
    which occupies more than one line -->
[download]

Note the whitespace between comment and --> in both cases. Also note that the documentation makes no further reference to whitespace in that position.

If anyone has more definitive information (e.g. Backus-Naur Form notation), a link to that would be useful and welcome.

For the OP: to also remove that remaining comment, regardless of whether it's valid or not, just change the \s to \s* in the regex:

s{<!-- \s ( . (?!-- \s* >) )*  \s* -- \s* >}{}gmsx
[download]

-- Ken

[reply]
[d/l]
[select]

by JavaFan (Canon) on Nov 22, 2010 at 11:18 UTC

It's invalid because there's no space before -->

COM

--

OTOH, your pattern falsely considers  to be a valid comment, while it doesn't consider  --> to be valid.

This matches HTML comments:

  <!(?:--(?:[^-]*(?:-[^-]+)*)--\s*)*>
[download]

\s

[reply]
[d/l]
[select]

by kcott (Archbishop) on Nov 22, 2010 at 23:57 UTC

Firstly, I've added an update to my post, please read that.

Secondly, rather than just stating "That's bogus ...", perhaps you could cite a reference.

-- Ken

Re^4: HTML stripper...

by JavaFan (Canon) on Nov 23, 2010 at 00:36 UTC

by JavaFan (Canon) on Nov 23, 2010 at 00:42 UTC

Note the whitespace between comment and --< in both cases. Also note that the documentation makes no further reference to whitespace in that position.

by Argel (Prior) on Nov 22, 2010 at 21:06 UTC

Legal:   "<!--" 
Illegal: "<! --"
  -and-
Legal:  "-->"
Legal:  "--  >"
[download]

Elda Taluta; Sarks Sark; Ark Arks

[reply]
[d/l]

by kcott (Archbishop) on Nov 22, 2010 at 23:39 UTC

"Did you even read what you linked to?"

That's fairly unpleasant, bordering on rudeness.

Note the <!-- (at the start of the regex) and the -- \s* > (at the end of the regex) which deals with those rules.

Also take a look at my updated post which indicates more of what I read.

-- Ken

[reply]
[d/l]
[select]

Re: HTML stripper...
by Anonymous Monk on Nov 22, 2010 at 05:10 UTC

I think I would call that a "comment stripper", since it strips off comments, not markup.