Reg Expressions

mikeblatter has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reg Expressions by silent11 (Vicar) on Feb 01, 2003 at 01:08 UTC
Look into the HTML::TokeParser module... -Silent11	[reply]
Re: Reg Expressions by BrowserUk (Patriarch) on Feb 01, 2003 at 01:41 UTC
If, and only if, your data contains only the meta tag, and you are not attempting to extract these from a larger set of html data, then this should work for you. `$string ='<meta name="description" content="..."/>'; print $1 if $string =~ m[^<meta .?name\s=\s"([^"]+)".?/>$]i; description` [download] It will only work if `$string` contains only meta tag and nothing else. Next comes the question of how would you isolate the meta tag from a larger body of HTML. The answer is that you almost certainly would need to use one of the HTML::* modules, at which point, the above becomes redundant as they will allow you to get at the attributes of the meta tags (and every other tag) without needing a regex. Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply] [d/l] [select]
Re: Reg Expressions by Anonymous Monk on Feb 01, 2003 at 20:55 UTC
Is it something like this: `$html = "<html><meta name=\"description\" content=\"sf\"></html>"; $html =~ /<meta.+?name\s=\s("\|')description\1\.+?content\s=\s("\|') +(.*?)\2/; print $3;` [download]	[reply] [d/l]
Re: Reg Expressions by cLive ;-) (Prior) on Feb 01, 2003 at 02:03 UTC
Dirty, but if you know the name always comes before the content and that all meta tags contain a name and content, you can use: `/<meta.+?name\s=\s("\|')description\1\.+?content\s=\s("\|')(.*?)\2/; my $description_content = $3;` [download] .02 cLive ;-) --	[reply] [d/l]
Re: Reg Expressions by Anonymous Monk on Feb 01, 2003 at 01:24 UTC
Yeah I saw that I was wondering if you guys know a reg expression for getting description from the meta tag.	[reply]
(jeffa) Re: Reg Expressions by jeffa (Bishop) on Feb 02, 2003 at 00:41 UTC
There have been plenty of working regexes posted for this thread, i thought there should be at least one reply that uses a parser. I'll assume that when you say you want the description, you really want the content from the meta tag. use strict; use warnings; use HTML::TokeParser::Simple; my $description; my $p = HTML::TokeParser::Simple->new(*DATA); while (my $token = $p->get_token) { if ($token->is_start_tag('meta')) { my $attr = $token->return_attr; if (defined $attr->{name}) { $description = $attr->{content}; last; } } } print "TokeParser got '$description'\n"; __DATA__ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http:/ +/www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="description" content="Hello HTML Parsing!" /> <meta name="keywords" content="make me the top hit!" /> <meta name="generator" content="Perl, baby. Perl." /> </head> <body> Hello World </body> </html> [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]