comment on

Okay, I won't make you use modules, but why not use them? If you decide to do things the right way, you can use this snippet from HTML::TokeParser::Simple:

        use strict;
        use HTML::TokeParser::Simple;

        my $new_folder = 'no_comment/';
        my @html_docs  = glob( "*.html" );

        foreach my $doc ( @html_docs ) {
            print "Processing $doc\n";
            my $new_file = "$new_folder$doc";

            open PHB, "> $new_file" or die "Cannot open $new_file for 
+writing: $!";

            my $p = HTML::TokeParser::Simple->new( $doc );
            while ( my $token = $p->get_token ) {
                next if $token->is_comment;
                print PHB $token->as_is;
            }
            close PHB;
        }
[download]

Why might you get the regex wrong? Aside from the fact that none of your regexes allow for newlines, here's a bit of information from the HTML::Parser documentation:

By default, comments are terminated by the first occurrence of "-->". This is the behaviour of most popular browsers (like Netscape and MSIE), but it is not correct according to the official HTML standard. Officially, you need an even number of "--" tokens before the closing ">" is recognized and there may not be anything but whitespace between an even and an odd "--".

This probably isn't an issue for you, but it is something to think about. If you're forced to deal with this at some point in the future, you'll find your regex isn't enough. Most who try to handroll their own code find "it isn't enough" when their reason for doing that is simply a desire to not use modules. Now, if you wanted to avoid using the modules because you understood your problem space and there were no appropriate modules that satisfied this, then you would have a better chance of getting this correct.

That being said, comments are generally easier to remove than other HTML snippets, but I still would not use a regular expression.

Cheers,
Ovid

New address of my CGI Course.

In reply to Re: Removing html comments with regex by Ovid
in thread Removing html comments with regex by n4mation

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.