in reply to Removing html comments with regex

Okay, I won't make you use modules, but why not use them? If you decide to do things the right way, you can use this snippet from HTML::TokeParser::Simple:

use strict; use HTML::TokeParser::Simple; my $new_folder = 'no_comment/'; my @html_docs = glob( "*.html" ); foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $new_file = "$new_folder$doc"; open PHB, "> $new_file" or die "Cannot open $new_file for +writing: $!"; my $p = HTML::TokeParser::Simple->new( $doc ); while ( my $token = $p->get_token ) { next if $token->is_comment; print PHB $token->as_is; } close PHB; }

Why might you get the regex wrong? Aside from the fact that none of your regexes allow for newlines, here's a bit of information from the HTML::Parser documentation:

By default, comments are terminated by the first occurrence of "-->". This is the behaviour of most popular browsers (like Netscape and MSIE), but it is not correct according to the official HTML standard. Officially, you need an even number of "--" tokens before the closing ">" is recognized and there may not be anything but whitespace between an even and an odd "--".

This probably isn't an issue for you, but it is something to think about. If you're forced to deal with this at some point in the future, you'll find your regex isn't enough. Most who try to handroll their own code find "it isn't enough" when their reason for doing that is simply a desire to not use modules. Now, if you wanted to avoid using the modules because you understood your problem space and there were no appropriate modules that satisfied this, then you would have a better chance of getting this correct.

That being said, comments are generally easier to remove than other HTML snippets, but I still would not use a regular expression.

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: Re: Removing html comments with regex
by n4mation (Acolyte) on Aug 23, 2003 at 05:28 UTC
    Please forgive my newbie novice questions, but how is all of that easier and faster than a one liner? Isn't regex supposed to be powerful, and fast? I still am really curious why the regex won't work. I wuz just thinking: Maybe the modules aren't haunted, but somethings making the comments disappear.

      It's easier because it's already written, it works, and you don't have to know how it works to use it.

      The problems with regular expressions and HTML is that HTML isn't required to be regular. It's not hard to write a simple regex that processes simple HTML, but it's easy to write realistic HTML that a regex won't process.

        And the good news is, I'm just about finished with a brand new version of Regexp::Token. I need to write the docs and start creating weird edge cases. When done, it should allow you to safely remove comments.

        my $html_comment = Regexp::Token->create($some_comment_token); $html =~ s/$html_comment//g;

        Or do things like:

            $html =~ /some text($p_tag)more text/;

        I'm leaving for the beach tomorrow morning, but by Monday I hope to have this posted. Too bad it's ridiculously slow.

        Cheers,
        Ovid

        New address of my CGI Course.

      In addition to chromatic's remarks, it's also a good idea to learn how to use the HTML parser modules because you will almost certainly run into an application for them later. Using them to delete comments is one thing, but if you ever want to do anything more complex with HTML, you'll already be familiar with this tool. I recommend Ovid's very intuitive HTML::TokeParser::Simple. (I was in your situation not too long ago.) :)

      --
      Allolex