a regex to parse html tags

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: a regex to parse html tags by wog (Curate) on Jul 05, 2001 at 06:54 UTC
First, use the HTML::Parser module or HTML::TokeParser, or in your case, probably the HTML::HeadParser module to parse the HTML. Regular expressions won't work. What if you have an HTML document with like: `<!-- I changed this, it was just <head><title></title></head> - djb (03 Jul 2001) --> <head> <title>Blah</title> <meta name="DESCRIPTION" value="About </head> tags."> </head>` [download] That's a whole lot harder to parse with regular expression. That said `[.\n]` creates a character class matching a period and a newline. The `[]`s interperate `.`s as not special. You could use the `/s` modifier (see perlre) and just use `.+` instead. The `/s` modifier will make `.` match even newlines. Another way is to use `(?:.\|\n)` which is the same as `(.\|\n)` except that it doesn't capture anything (into the `$<digit>` variables.) Also you need to actually escape `/` in regexs if you are using `/` as the deliminator with like: `\/`, or you can avoid that ugliness by using an alternate deliminator (like `m!regex goes here!` or `m(regex)`.) I assume the lack of a `/` at the end of your regex is an error made in posting your code here. update: To give another example of why not to use a regex: `<head something="someattribute">...</head>` won't be handled by a simple regex either. update 2: fixed typo of `</head>` where `</title>` was meant and other minor typos.	[reply] [d/l] [select]
Re: Re: a regex to parse html tags by Hofmator (Curate) on Jul 05, 2001 at 12:52 UTC
Just one small addition to the problem of matching any character. The /s modifier is definitely the way to go, so that . matches everything including newlines. If for some reason you don't want to use the \s modifier - maybe you have other dots in your regex which should not match newlines - you should use a character class. The advantage over `(?:.\|\n)` is that no backtracking has to be done. `# character class matching any one character /[\000-\377]/ # or equivalent /[\d\D]/` [download] -- Hofmator	[reply] [d/l] [select]
Re: a regex to parse html tags by Anonymous Monk on Jul 05, 2001 at 06:56 UTC
Use the s modifier, here are two that usually work: `if($stuff =~ /<head>(.*?)<\/head>/s){ $what_i_want = $1; } if($stuff =~ /<head>([^<]+)<\/head>/s){ $what_i_want = $1; }` [download]	[reply] [d/l]
Re: a regex to parse html tags by elusion (Curate) on Jul 05, 2001 at 06:36 UTC
It would seem you're missing the closing slash on your regex. - p u n k k i d "Reality is merely an illusion, albeit a very persistent one." -Albert Einstein	[reply]
Re: a regex to parse html tags by dsb (Chaplain) on Jul 05, 2001 at 23:32 UTC
You could perhaps try something with a negated character class: `$var =~ m%<HEAD>([^<]+)</HEAD%i; print $1, "\n";` [download] The '^' at the beginning of the character class creates a negated character class, which essentially means: "Match anything that is not <" You could also use HTML::Parser and the like, however if you are only trying to match the <HEAD> tags then using the module might be unnecessary. Amel - f.k.a. - kel	[reply] [d/l]
Re: a regex to parse html tags by sierrathedog04 (Hermit) on Jul 05, 2001 at 14:48 UTC
Add /gs to the end of your regex, e.g., s/hello/goodbye/gs g will perform multiple matches instead of just one (you said you wanted everything.) s will perform the match across newlines.	[reply]
Re: a regex to parse html tags by Anonymous Monk on Jul 05, 2001 at 06:53 UTC
whoops! assume the closing slash is there	[reply]