comment on

As we all know, the canonical example of what not to do with regular expressions is to parse HTML.

It always bugs me when I see people say this. Its one of those self-defeating generalizations that just confuses things because people observe that when taken literally it often isn't true.

If I have a static piece of HTML, especially machine generated and/or simply structured I can easily munge and extract with a regex or two and a bit of logic. This will take far less time than using HTML::Parser or HTML::TokeParser or HTML::TreeBuilder or your tokenizer here.

On the other hand it is very difficult to parse any arbitrary page using the same approach. In fact it is usually trivial to reverse engineer a regex based parser to construct an HTML snippet that will break the parser.

Anyway my point is that parsing any arbitrary HTML is hard to do with regexes, however on occasion it can be just the thing you need to rip the essential data out of some specific web-page or html-report. If you are only going to run the extractor once then sometimes propper parsing is just too big a hammer to get out of the box. Accordingly i'd prefer to see that line rephrased.

:-)

In reply to Re: How to use Regular Expressions with HTML by Anonymous Monk
in thread How to use Regular Expressions with HTML by Ovid

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.