As we all know, the canonical example of what not to do with regular expressions is to parse HTML.
It always bugs me when I see people say this. Its one of those self-defeating generalizations that just confuses things because people observe that when taken literally it often isn't true.
If I have a static piece of HTML, especially machine generated and/or simply structured I can easily munge and extract with a regex or two and a bit of logic. This will take far less time than using HTML::Parser or HTML::TokeParser or HTML::TreeBuilder or your tokenizer here.
On the other hand it is very difficult to parse any arbitrary page using the same approach. In fact it is usually trivial to reverse engineer a regex based parser to construct an HTML snippet that will break the parser.
Anyway my point is that parsing any arbitrary HTML is hard to do with regexes, however on occasion it can be just the thing you need to rip the essential data out of some specific web-page or html-report. If you are only going to run the extractor once then sometimes propper parsing is just too big a hammer to get out of the box. Accordingly i'd prefer to see that line rephrased.
:-)
In reply to Re: How to use Regular Expressions with HTML
by Anonymous Monk
in thread How to use Regular Expressions with HTML
by Ovid
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |