I think it is worth adding to this thread a quick explanation on why regular expressions are a very risky way to process HTML (as well as XML).
HTML is a very complex standard, the rules about what is a tag, how they can be ommited and infered, about whitespaces etc... are really complex.
Here are a list of proper meta tags that "naive" or even quite advanced regular expressions would fail to match (all are correctly parsed by HTML::TokeParser and I guess by HTML::TokeParser::Simple):
<meta NAME="HTML.author" CONTENT="Joe Smith"> <!-- 2 spaces before + NAME --> <meta NAME=HTML.author CONTENT="Joe Smith"> <!-- no " around the + value of NAME --> <meta NAME="HTML.author" CONTENT='Joe Smith'> <!-- you can use ' i +nstead of " --> <meta CONTENT="Joe Smith" NAME="HTML.author"> <!-- the order of at +tributes is not significant --> <meta NAME="HTML.author" CONTENT="Joe Smith"> <!-- tags can span a +ccross several lines--> <meta NAME="HTML.author" CONTENT="Joe Smith" /> <!-- XHTML style -->
I am sure other examples could be found.
And of course you might grab things that look like a meta tag but are not, like:
<!-- <meta NAME="HTML.author" CONTENT="Joe Smith"> --> <!-- yep, a comment -->In reply to Re: Fetching meta tag info
by mirod
in thread Fetching meta tag info
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |