in reply to Fetching meta tag info
I think it is worth adding to this thread a quick explanation on why regular expressions are a very risky way to process HTML (as well as XML).
HTML is a very complex standard, the rules about what is a tag, how they can be ommited and infered, about whitespaces etc... are really complex.
Here are a list of proper meta tags that "naive" or even quite advanced regular expressions would fail to match (all are correctly parsed by HTML::TokeParser and I guess by HTML::TokeParser::Simple):
<meta NAME="HTML.author" CONTENT="Joe Smith"> <!-- 2 spaces before + NAME --> <meta NAME=HTML.author CONTENT="Joe Smith"> <!-- no " around the + value of NAME --> <meta NAME="HTML.author" CONTENT='Joe Smith'> <!-- you can use ' i +nstead of " --> <meta CONTENT="Joe Smith" NAME="HTML.author"> <!-- the order of at +tributes is not significant --> <meta NAME="HTML.author" CONTENT="Joe Smith"> <!-- tags can span a +ccross several lines--> <meta NAME="HTML.author" CONTENT="Joe Smith" /> <!-- XHTML style -->
I am sure other examples could be found.
And of course you might grab things that look like a meta tag but are not, like:
<!-- <meta NAME="HTML.author" CONTENT="Joe Smith"> --> <!-- yep, a comment -->
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Fetching meta tag info
by dingus (Friar) on Nov 13, 2002 at 08:21 UTC |