in reply to a regex to parse html tags

First, use the HTML::Parser module or HTML::TokeParser, or in your case, probably the HTML::HeadParser module to parse the HTML. Regular expressions won't work. What if you have an HTML document with like:
<!-- I changed this, it was just <head><title></title></head> - djb (03 Jul 2001) --> <head> <title>Blah</title> <meta name="DESCRIPTION" value="About </head> tags."> </head>
That's a whole lot harder to parse with regular expression.

That said [.\n] creates a character class matching a period and a newline. The []s interperate .s as not special. You could use the /s modifier (see perlre) and just use .+ instead. The /s modifier will make . match even newlines. Another way is to use (?:.|\n) which is the same as (.|\n) except that it doesn't capture anything (into the $<digit> variables.)

Also you need to actually escape / in regexs if you are using / as the deliminator with like: \/, or you can avoid that ugliness by using an alternate deliminator (like m!regex goes here! or m(regex).) I assume the lack of a / at the end of your regex is an error made in posting your code here.

update: To give another example of why not to use a regex: <head something="someattribute">...</head> won't be handled by a simple regex either.

update 2: fixed typo of </head> where </title> was meant and other minor typos.

Replies are listed 'Best First'.
Re: Re: a regex to parse html tags
by Hofmator (Curate) on Jul 05, 2001 at 12:52 UTC

    Just one small addition to the problem of matching any character. The /s modifier is definitely the way to go, so that . matches everything including newlines.

    If for some reason you don't want to use the \s modifier - maybe you have other dots in your regex which should not match newlines - you should use a character class. The advantage over (?:.|\n) is that no backtracking has to be done.

    # character class matching any one character /[\000-\377]/ # or equivalent /[\d\D]/

    -- Hofmator