comment on

It might be useful to read up a bit on the theory of formal languages. You'll see that there's a whole family of languages, each described by a certain mathematical formalism. Regular languages are an example, and as you can guess they're described by regular expressions. Unfortunately, HTML is not a regular language and hence can not be described by regular expressions since they're just not powerful enough.

By way of example, consider <em>hello beautiful HTML <em>world</em></em>: easy to write a regular expression to get the inner "world", isn't it? Now consider <em>hello <em>beautiful<em>HTML world</em></em></em>, if you want to match something, again you can write a regular expression... as long as you know the maximum number of times the <em>...</em> tags will be embedded.

HTML allows unbounded nesting of tags, so this means that you can't write a general regular expression that describes every possible nesting situation. Regular expression are simply not powerful enough, you'll need at least context free languages, hence a tool such as HTML::Parser or for general cases something like Parse::RecDescent.

Now you can argue:

yeah right, but real world HTML is not that complicated, or
you can fiddle with embedded code and cuts in regular expressions.

As to the first argument: you don't always know this in advance if you don't control the HTML generation yourself, people are bound to do weird things, mostly not even on purpose.
As to the second argument: true, but these are still experimental features (as the docs specify for 5.6.1) and they're not at all obvious to use, even up to the point that it is easier to use a more powerful tool than get the particular regular expression right. (Note from a formal language theory point of view: embedded code, cuts and the like increase Perl "regular expressions" beyond regular languages.)

Given this story, your claim that one can deal with all problems HTML by using regular expressions shows some unwarranted optimism on your part. Obviously there's no reason to believe me, so I'll suggest a number of references on the subject:

Introduction to the theory of computation by M. Sipser, an excellent book that's a pleasure to read,
Computability, complexity and languages by M.D. Davis et al., definitely harder than Sipser's, but nice too and
Introduction to automata theory, languages, and computation by J.E. Hopcroft et al., a classic in the field.

And who knows, maybe our own mstone will write a MOPT on the subject one of these days? (Hint, hint ;-)

Just my 2 cents, -gjb-

Update: Thanks TheHobbit for reiterating the points I actually mention in my text if you bother to read it carefully. (?{...}) and /e are called code embedding.

In reply to Re: Re: A few random questions from Learning Perl 3 by gjb
in thread A few random questions from Learning Perl 3 by sulfericacid

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Welcome to the Monastery
	PerlMonks