n4mation has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Removing html comments with regex
by Ovid (Cardinal) on Aug 23, 2003 at 05:09 UTC | |
Okay, I won't make you use modules, but why not use them? If you decide to do things the right way, you can use this snippet from HTML::TokeParser::Simple:
Why might you get the regex wrong? Aside from the fact that none of your regexes allow for newlines, here's a bit of information from the HTML::Parser documentation: By default, comments are terminated by the first occurrence of "-->". This is the behaviour of most popular browsers (like Netscape and MSIE), but it is not correct according to the official HTML standard. Officially, you need an even number of "--" tokens before the closing ">" is recognized and there may not be anything but whitespace between an even and an odd "--". This probably isn't an issue for you, but it is something to think about. If you're forced to deal with this at some point in the future, you'll find your regex isn't enough. Most who try to handroll their own code find "it isn't enough" when their reason for doing that is simply a desire to not use modules. Now, if you wanted to avoid using the modules because you understood your problem space and there were no appropriate modules that satisfied this, then you would have a better chance of getting this correct. That being said, comments are generally easier to remove than other HTML snippets, but I still would not use a regular expression. Cheers, New address of my CGI Course. | [reply] [d/l] |
by n4mation (Acolyte) on Aug 23, 2003 at 05:28 UTC | |
| [reply] |
by chromatic (Archbishop) on Aug 23, 2003 at 05:44 UTC | |
It's easier because it's already written, it works, and you don't have to know how it works to use it. The problems with regular expressions and HTML is that HTML isn't required to be regular. It's not hard to write a simple regex that processes simple HTML, but it's easy to write realistic HTML that a regex won't process. | [reply] |
by Ovid (Cardinal) on Aug 23, 2003 at 06:15 UTC | |
by allolex (Curate) on Aug 23, 2003 at 07:12 UTC | |
In addition to chromatic's remarks, it's also a good idea to learn how to use the HTML parser modules because you will almost certainly run into an application for them later. Using them to delete comments is one thing, but if you ever want to do anything more complex with HTML, you'll already be familiar with this tool. I recommend Ovid's very intuitive HTML::TokeParser::Simple. (I was in your situation not too long ago.) :)
-- | [reply] |
|
Re: Removing html comments with regex
by blokhead (Monsignor) on Aug 23, 2003 at 03:28 UTC | |
Please don't make me use the modules!You make it sound like they're haunted or something...
Update: Hey, your second regex worked for this data. Are you sure it doesn't work for you?
blokhead | [reply] [d/l] |
|
Re: Removing html comments with regex
by Abigail-II (Bishop) on Aug 23, 2003 at 22:39 UTC | |
You, and several of the people replying, seem to have a wrong impression of what HTML comments are. HTML comments are not pieces of text delimited by <!-- and -->. It's more complicated. Zero or more comments may appear inside a comment declaration. The declaration is delimited by <! and >. Comments are delimited by --. There is optional whitescape after each comment. The following quote is from RFC 1866, the only RFC to ever define an HTML standard: To include comments in an HTML document, use a comment declaration. A comment declaration consists of `<!' followed by zero or more comments followed by `>'. Each comment starts with `--' and includes all text up to and including the next occurrence of `--'. In a comment declaration, white space is allowed after each comment, but not before the first comment. The entire comment declaration is ignored. The SGML Handbook [1] has to say the following about comments:
The s rule is similar to \s* in Perl regular expressions. The rule SGML character lists valid characters. This means that a regexp to recognize HTML comments is something like:
Or you could simply do:
[1] Charles F. Goldfarb: The SGML Handbook. Oxford, Clarendon Press, 1990. Abigail | [reply] [d/l] [select] |
|
Re: Removing html comments with regex
by Theo (Priest) on Aug 23, 2003 at 21:40 UTC | |
-ted- | [reply] [d/l] |
by Abigail-II (Bishop) on Aug 23, 2003 at 23:09 UTC | |
Abigail | [reply] |
by Theo (Priest) on Aug 24, 2003 at 22:01 UTC | |
-ted- | [reply] |
|
Re: Removing html comments with regex
by Anonymous Monk on Aug 23, 2003 at 04:19 UTC | |
<!-- comments --> snippet of code <!-- comments --> | [reply] [d/l] [select] |
by esh (Pilgrim) on Aug 23, 2003 at 06:33 UTC | |
You have provided a regular expression and some sample data. You say that the regular expression removes "snippet of code" from your sample data, but it does not. Here, I've adapted blokhead's framework and plugged in your regular expression and your data: The output is: If you want to have someone debug your problem, you'll have to provide the real regular expression you're using or the real data. (And, yes, I did test it out with multi-line comments, too.) BTW, $comments is a little strange as a name for text which includes more than just comments. Are you sure that it contains the snippet of code before you apply the regular expression? -- Eric Hammond | [reply] [d/l] [select] |
|
Re: Removing html comments with regex
by n4mation (Acolyte) on Aug 24, 2003 at 02:04 UTC | |
I did not say the regex was removing the snippet of code. I said the comment tags were removing the snippet of code. If I have something like this: <!-- BEGIN N4MATIONS POOP --> Several lines of code in here <!-- END N4MATIONS POOP --> If I leave both the comment tags, it stores nothing in the file. I can remove either the begin tag, or the end tag manually, and the code will store in the file. I want to remove the comment tags with regex, and I guess I'm missing something. I don't speak fluent Perl, so I don't understand much yet. I am using another regex for the same textarea, and it works fine. I appreciate the help though. | [reply] [d/l] [select] |
|
Re: Removing html comments with regex
by davido (Cardinal) on Aug 25, 2003 at 03:34 UTC | |
You will find the possibility of nested comments, of server-side includes (which look like comments), of comments with multiple --comment-- blocks, and who knows what else, that could foul up your regexp plan. There is also good logical reason to use modules. You might spend five hours figuring out your regular expression, and it still won't work 100% of the time. A well written module has many thousands of collective man-hours of work, from not just the primary author but also the vast user base of that module within the Perl community. One person alone may or may not get something right. The collective voice of a large base of programmers, who put modules through hoops that the primary author may not have even considered in the first place, contribute their suggestions and comments, and bug alerts. They expose flaws, they find nits to pick, and in the end, the module emerges robust, reliable, and secure. This is an evolving process; nobody can say that in a fast-moving infrastructure such as the Internet or computers in general that a module is ever finished. But it's lightyears down the development path past the regexp that you might cook up in an evening with a slice or two of pizza. The famous quote is applicable in this situation: "You can fool all of the people some of the time, and you can fool some of the people all of the time, but you cannot fool all of the people all of the time." A module has to stand up to the rigors of all of the people, all of the time. It has proven itself not just with some people sometimes, but all people (who use it) always. To get the kind of robustness that you can find in well-written and trusted modules, you would have to quit your day job for the next ten years. One reason Perl exists is because we're all basically lazy. Perl's proponents are terribly lazy, and Perl helps them to support the lazy lifestyle. (Ok, many are also hard workers, but lazily so). Modules support lazyness too, which is a good thing. But refusing to learn to use modules, out of lazyness, is false lazyness, or misguided lazyness, for that extra 10 minutes it takes to figure out how to use a module will save you countless hours down the road. Another reason for using the module to solve a problem is that it is already the answer to your question. The module is a form of FAQ. It is the answer to a frequent need, rather than a frequent question. When someone comes and says, how do I accomplish this, people say, oh, use the module. Many people have put time and effort into the module. Some of them are in this forum. If you say, "But I don't want to use the module" (despite the fact that it is designed to be the answer to your problem), you are saying, "Thanks for changing my tire for me, but could you do it again, this time lifting the car by hand instead of using the jack?" Don't resist what works best for the vast majority. Your case is not so different from that of everyone else.
Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein | [reply] |