in reply to Removing html comments with regex
Okay, I won't make you use modules, but why not use them? If you decide to do things the right way, you can use this snippet from HTML::TokeParser::Simple:
use strict; use HTML::TokeParser::Simple; my $new_folder = 'no_comment/'; my @html_docs = glob( "*.html" ); foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $new_file = "$new_folder$doc"; open PHB, "> $new_file" or die "Cannot open $new_file for +writing: $!"; my $p = HTML::TokeParser::Simple->new( $doc ); while ( my $token = $p->get_token ) { next if $token->is_comment; print PHB $token->as_is; } close PHB; }
Why might you get the regex wrong? Aside from the fact that none of your regexes allow for newlines, here's a bit of information from the HTML::Parser documentation:
By default, comments are terminated by the first occurrence of "-->". This is the behaviour of most popular browsers (like Netscape and MSIE), but it is not correct according to the official HTML standard. Officially, you need an even number of "--" tokens before the closing ">" is recognized and there may not be anything but whitespace between an even and an odd "--".
This probably isn't an issue for you, but it is something to think about. If you're forced to deal with this at some point in the future, you'll find your regex isn't enough. Most who try to handroll their own code find "it isn't enough" when their reason for doing that is simply a desire to not use modules. Now, if you wanted to avoid using the modules because you understood your problem space and there were no appropriate modules that satisfied this, then you would have a better chance of getting this correct.
That being said, comments are generally easier to remove than other HTML snippets, but I still would not use a regular expression.
Cheers,
Ovid
New address of my CGI Course.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Removing html comments with regex
by n4mation (Acolyte) on Aug 23, 2003 at 05:28 UTC | |
by chromatic (Archbishop) on Aug 23, 2003 at 05:44 UTC | |
by Ovid (Cardinal) on Aug 23, 2003 at 06:15 UTC | |
by allolex (Curate) on Aug 23, 2003 at 07:12 UTC |