n4mation has asked for the wisdom of the Perl Monks concerning the following question:

I have researched the regex for removing html comment tags:
<!-- comments -->
The one I have is supposed to work:
$comments =~ s/<!--.*?-->//g;
I tried this one to take care of breaks:
/$comments =~ s/<!--(.|\s)*?-->//g;
It ain't werkin'
Any idees?
Thanky!
Please don't make me use the modules!

Replies are listed 'Best First'.
Re: Removing html comments with regex
by Ovid (Cardinal) on Aug 23, 2003 at 05:09 UTC

    Okay, I won't make you use modules, but why not use them? If you decide to do things the right way, you can use this snippet from HTML::TokeParser::Simple:

    use strict; use HTML::TokeParser::Simple; my $new_folder = 'no_comment/'; my @html_docs = glob( "*.html" ); foreach my $doc ( @html_docs ) { print "Processing $doc\n"; my $new_file = "$new_folder$doc"; open PHB, "> $new_file" or die "Cannot open $new_file for +writing: $!"; my $p = HTML::TokeParser::Simple->new( $doc ); while ( my $token = $p->get_token ) { next if $token->is_comment; print PHB $token->as_is; } close PHB; }

    Why might you get the regex wrong? Aside from the fact that none of your regexes allow for newlines, here's a bit of information from the HTML::Parser documentation:

    By default, comments are terminated by the first occurrence of "-->". This is the behaviour of most popular browsers (like Netscape and MSIE), but it is not correct according to the official HTML standard. Officially, you need an even number of "--" tokens before the closing ">" is recognized and there may not be anything but whitespace between an even and an odd "--".

    This probably isn't an issue for you, but it is something to think about. If you're forced to deal with this at some point in the future, you'll find your regex isn't enough. Most who try to handroll their own code find "it isn't enough" when their reason for doing that is simply a desire to not use modules. Now, if you wanted to avoid using the modules because you understood your problem space and there were no appropriate modules that satisfied this, then you would have a better chance of getting this correct.

    That being said, comments are generally easier to remove than other HTML snippets, but I still would not use a regular expression.

    Cheers,
    Ovid

    New address of my CGI Course.

      Please forgive my newbie novice questions, but how is all of that easier and faster than a one liner? Isn't regex supposed to be powerful, and fast? I still am really curious why the regex won't work. I wuz just thinking: Maybe the modules aren't haunted, but somethings making the comments disappear.

        It's easier because it's already written, it works, and you don't have to know how it works to use it.

        The problems with regular expressions and HTML is that HTML isn't required to be regular. It's not hard to write a simple regex that processes simple HTML, but it's easy to write realistic HTML that a regex won't process.

        In addition to chromatic's remarks, it's also a good idea to learn how to use the HTML parser modules because you will almost certainly run into an application for them later. Using them to delete comments is one thing, but if you ever want to do anything more complex with HTML, you'll already be familiar with this tool. I recommend Ovid's very intuitive HTML::TokeParser::Simple. (I was in your situation not too long ago.) :)

        --
        Allolex

Re: Removing html comments with regex
by blokhead (Monsignor) on Aug 23, 2003 at 03:28 UTC
    Use the /s option to s/// to allow the period to match newlines as well. This works for my multiline comment test:
    my $stuff = do { local $/; <DATA> }; $stuff =~ s/<!--.*?--\s*>//gs; print "$stuff\n"; __DATA__ asdfasdkf <!-- asdfaskdf --> Hello <!-- World -->
    Please don't make me use the modules!
    You make it sound like they're haunted or something...

    Update: Hey, your second regex worked for this data. Are you sure it doesn't work for you?
    Update2: The spec allows for whitespace between the closing -- and >: updated the regex accordingly.

    blokhead

Re: Removing html comments with regex
by Abigail-II (Bishop) on Aug 23, 2003 at 22:39 UTC
    $comments =~ s/<!--.*?-->//g; $comments =~ s/<!--(.|\s)*?-->//g;

    You, and several of the people replying, seem to have a wrong impression of what HTML comments are. HTML comments are not pieces of text delimited by <!-- and -->. It's more complicated. Zero or more comments may appear inside a comment declaration. The declaration is delimited by <! and >. Comments are delimited by --. There is optional whitescape after each comment.

    The following quote is from RFC 1866, the only RFC to ever define an HTML standard:

    To include comments in an HTML document, use a comment declaration. A comment declaration consists of `<!' followed by zero or more comments followed by `>'. Each comment starts with `--' and includes all text up to and including the next occurrence of `--'. In a comment declaration, white space is allowed after each comment, but not before the first comment. The entire comment declaration is ignored.

    NOTE - Some historical HTML implementations incorrectly consider any `>' character to be the termination of a comment.

    For example:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HEAD> <TITLE>HTML Comment Example</TITLE> <!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp --> <!-- another -- -- comment --> <!> </HEAD> <BODY> <p> <!- not a comment, just regular old data characters ->

    The SGML Handbook [1] has to say the following about comments:

    10.3 Comment Declaration

    A comment is a string of SGML characters that is used in the parameter separators of markup declarations to explain what the declaration is doing.

    It is also possible to have a declaration that contains nothing but comments; this is called a comment declaration. Comment declarations are frequently used in markup declaration subsets to separate the declarations from one another and to explain what each group of declarations is for. They can also be used by authors in the document instance to provide instructions for other people working with the document, or reminders for themselves.

    [91] comment declaration = 2 mdo, [ <! ] ( comment, [ [92] 391: 7] 4 ( s | [ [5] 297:23] comment )* )?, [ [92] 391: 7] 6 mdc [ > ] [92] comment = 8 com, [ -- ] SGML character*, [ [50] 345: 1] 10 com [ -- ] No markup is recognized in a comment [ [92] 391:7], other than the 12 com delimiter that terminates it.

    The s rule is similar to \s* in Perl regular expressions. The rule SGML character lists valid characters.

    This means that a regexp to recognize HTML comments is something like:

    /<!(?:--(?:[^-]*|-[^-]+)*--\s*)>/

    Or you could simply do:

    use Regexp::Common; /$RE{comment}{HTML}/

    [1] Charles F. Goldfarb: The SGML Handbook. Oxford, Clarendon Press, 1990.

    Abigail

Re: Removing html comments with regex
by Theo (Priest) on Aug 23, 2003 at 21:40 UTC
    In addition to the above cautions, a simple regex may also delete some Server Side Include commands since they look a lot like comments.

    <!--#include virtual="/ssi/navbarB.ssi" --> <!--#echo var="SERVER_NAME" --> <!--#echo var="Document_URI" --> <!--#config timefmt="%Y/%m/%d at %H:%M:%S %Z" --> <!--#echo var="LAST_MODIFIED" -->

    -ted-

      If that appears in an HTML document, they not only look like comments, they are HTML comments.

      Abigail

        If SSI commands appear in the html sent to the browser, then something went wrong at the server. So you're right from the browsers point of view - they are comments. But if he is working with source html, on the server, he could run into that situation.

        -ted-

Re: Removing html comments with regex
by Anonymous Monk on Aug 23, 2003 at 04:19 UTC
    Im trying to store a snippet along with other html in a file from a form textarea field. Everything else stores in there. The comment makes the stuff in the snippet dissapear. I tried removing other regex to avoid any conflicts. If I remove one of the comments, the snippet shows up even without the regex. Otherwise, the snippet dissapears
    <!-- comments -->
    snippet of code
    <!-- comments -->

      You have provided a regular expression and some sample data. You say that the regular expression removes "snippet of code" from your sample data, but it does not.

      Here, I've adapted blokhead's framework and plugged in your regular expression and your data:

      #!/usr/bin/perl -w use strict; my $comments = do { local $/; <DATA> }; print "BEFORE:\n$comments"; $comments =~ s/<!--(.|\s)*?-->//g; print "AFTER:\n$comments"; __DATA__ <!-- comments --> snippet of code <!-- comments -->
      The output is:
      BEFORE: <!-- comments --> snippet of code <!-- comments --> AFTER: snippet of code
      If you want to have someone debug your problem, you'll have to provide the real regular expression you're using or the real data. (And, yes, I did test it out with multi-line comments, too.)

      BTW, $comments is a little strange as a name for text which includes more than just comments. Are you sure that it contains the snippet of code before you apply the regular expression?

      -- Eric Hammond

Re: Removing html comments with regex
by n4mation (Acolyte) on Aug 24, 2003 at 02:04 UTC
    Esh:
    I did not say the regex was removing the snippet of code. I said the comment tags were removing the snippet of code. If I have something like this:

    <!-- BEGIN N4MATIONS POOP -->
    Several lines of
    code in here
    <!-- END N4MATIONS POOP -->

    If I leave both the comment tags, it stores nothing in the file. I can remove either the begin tag, or the end tag manually, and the code will store in the file. I want to remove the comment tags with regex, and I guess I'm missing something. I don't speak fluent Perl, so I don't understand much yet. I am using another regex for the same textarea, and it works fine. I appreciate the help though.
Re: Removing html comments with regex
by davido (Cardinal) on Aug 25, 2003 at 03:34 UTC
    There are a lot of reasons why people have responded to your question with the answer that you should use a module to parse HTML. Among the most important reason is that HTML cannot simply be handled by a regular expression. In the simple case of removing just comments from HTML there are more things to consider than a simple regexp can accomplish.

    You will find the possibility of nested comments, of server-side includes (which look like comments), of comments with multiple --comment-- blocks, and who knows what else, that could foul up your regexp plan.

    There is also good logical reason to use modules. You might spend five hours figuring out your regular expression, and it still won't work 100% of the time. A well written module has many thousands of collective man-hours of work, from not just the primary author but also the vast user base of that module within the Perl community.

    One person alone may or may not get something right. The collective voice of a large base of programmers, who put modules through hoops that the primary author may not have even considered in the first place, contribute their suggestions and comments, and bug alerts. They expose flaws, they find nits to pick, and in the end, the module emerges robust, reliable, and secure. This is an evolving process; nobody can say that in a fast-moving infrastructure such as the Internet or computers in general that a module is ever finished. But it's lightyears down the development path past the regexp that you might cook up in an evening with a slice or two of pizza.

    The famous quote is applicable in this situation: "You can fool all of the people some of the time, and you can fool some of the people all of the time, but you cannot fool all of the people all of the time."

    A module has to stand up to the rigors of all of the people, all of the time. It has proven itself not just with some people sometimes, but all people (who use it) always. To get the kind of robustness that you can find in well-written and trusted modules, you would have to quit your day job for the next ten years.

    One reason Perl exists is because we're all basically lazy. Perl's proponents are terribly lazy, and Perl helps them to support the lazy lifestyle. (Ok, many are also hard workers, but lazily so). Modules support lazyness too, which is a good thing. But refusing to learn to use modules, out of lazyness, is false lazyness, or misguided lazyness, for that extra 10 minutes it takes to figure out how to use a module will save you countless hours down the road.

    Another reason for using the module to solve a problem is that it is already the answer to your question. The module is a form of FAQ. It is the answer to a frequent need, rather than a frequent question. When someone comes and says, how do I accomplish this, people say, oh, use the module. Many people have put time and effort into the module. Some of them are in this forum. If you say, "But I don't want to use the module" (despite the fact that it is designed to be the answer to your problem), you are saying, "Thanks for changing my tire for me, but could you do it again, this time lifting the car by hand instead of using the jack?"

    Don't resist what works best for the vast majority. Your case is not so different from that of everyone else.

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein