Re: a regex to parse html tags

First, use the HTML::Parser module or HTML::TokeParser, or in your case, probably the HTML::HeadParser module to parse the HTML. Regular expressions won't work. What if you have an HTML document with like:

<!--
I changed this, it was just <head><title></title></head>
 - djb (03 Jul 2001)
-->
<head>
<title>Blah</title>
<meta name="DESCRIPTION" value="About </head> tags.">
</head>
[download]

That's a whole lot harder to parse with regular expression.

That said [.\n] creates a character class matching a period and a newline. The []s interperate .s as not special. You could use the /s modifier (see perlre) and just use .+ instead. The /s modifier will make . match even newlines. Another way is to use (?:.|\n) which is the same as (.|\n) except that it doesn't capture anything (into the $<digit> variables.)

Also you need to actually escape / in regexs if you are using / as the deliminator with like: \/, or you can avoid that ugliness by using an alternate deliminator (like m!regex goes here! or m(regex).) I assume the lack of a / at the end of your regex is an error made in posting your code here.

update: To give another example of why not to use a regex: <head something="someattribute">...</head> won't be handled by a simple regex either.

update 2: fixed typo of </head> where </title> was meant and other minor typos.

Comment on Re: a regex to parse html tags Select or Download Code

Replies are listed 'Best First'.
Re: Re: a regex to parse html tags by Hofmator (Curate) on Jul 05, 2001 at 12:52 UTC
Just one small addition to the problem of matching any character. The /s modifier is definitely the way to go, so that . matches everything including newlines. If for some reason you don't want to use the \s modifier - maybe you have other dots in your regex which should not match newlines - you should use a character class. The advantage over `(?:.\|\n)` is that no backtracking has to be done. `# character class matching any one character /[\000-\377]/ # or equivalent /[\d\D]/` [download] -- Hofmator	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: a regex to parse html tags
by Hofmator (Curate) on Jul 05, 2001 at 12:52 UTC

Just one small addition to the problem of matching any character. The /s modifier is definitely the way to go, so that . matches everything including newlines.

If for some reason you don't want to use the \s modifier - maybe you have other dots in your regex which should not match newlines - you should use a character class. The advantage over (?:.|\n) is that no backtracking has to be done.

# character class matching any one character
/[\000-\377]/

# or equivalent
/[\d\D]/
[download]

-- Hofmator

[reply]
[d/l]
[select]