Regular Expression

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expression by waswas-fng (Curate) on Jun 28, 2005 at 19:06 UTC
Try using HTML::TokeParser::Simple or HTML::TokeParser for parsing HTML. It will save you lots of headaches when you run into random issues on any complex regex you cookup for this task. -Waswas	[reply]
Re: Regular Expression by Transient (Hermit) on Jun 28, 2005 at 19:01 UTC
HTML::TreeBuilder (man, I'm lovin' these one-line answers ;) )	[reply]
Re^2: Regular Expression by BUU (Prior) on Jun 28, 2005 at 19:04 UTC
HTML::TokeParser::Simple (My way is the one true way!)	[reply]
Re^3: Regular Expression by l.frankline (Hermit) on Jun 29, 2005 at 13:51 UTC
good	[reply]
Re: Regular Expression by davidrw (Prior) on Jun 28, 2005 at 19:23 UTC
See this node: Regular Expressions for almost the exact same question. Why can't you use modules? The most robust way will be something like HTML::Parser -- look specifically at the examples section for extracting the `<title>` tag. for one-time quick & dirty, use a regex (this assumes, of course, that there isn't a > in the onLoad javascript): `if( $html =~ /<body (.?)>/si ){ my $body_attributes = $1; }` [download] Maybe something like this will help guard against javascript screwing up the match, but assumes ~~proper quoting of the attributes~~: `/<body((?:\s+(?:\w+=".?")))>/si` [download] Update:* added strike and bold after reading/noting ikegami's response	[reply] [d/l] [select]
Re^2: Regular Expression by ikegami (Patriarch) on Jun 28, 2005 at 21:38 UTC
I was directed to the second regexp this post as a solution that fixes problems in another post, but it's no better. but assumes proper quoting of the attributes: The HTML spec allows for single quotes, and even allows for the quotes to be omitted in some circumstances, so no, it doesn't assume proper quoting. Also, it doesn't handle `>` inside of quotes (where it doesn't need to be escaped). Finally, it could locate `<body>` inside of a comment or inside of another attribute.	[reply] [d/l] [select]
Re: Regular Expression by fmerges (Chaplain) on Jun 28, 2005 at 21:08 UTC
Hi, I use this for example in Mason components filters: `$html =~ s{(<body.*?)>}{$1 onLoad="window.print()">}is;` [download] Hope it can be helpfull Regards, :-)	[reply] [d/l]
Re: Regular Expression by kwaping (Priest) on Jun 28, 2005 at 19:18 UTC
`my $body = $the_html_string; $body =~ s/(<body.*?>)/$1/sgi;` [download]	[reply] [d/l]
Re^2: Regular Expression by CountZero (Bishop) on Jun 28, 2005 at 19:24 UTC
What will happen when a `<body>` tag is included in comments? Your regex will break. I'm almost sure that the one who gave the instruction to find the `<body>` of the HTML code with a simple regex did not think about this. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re^3: Regular Expression by kwaping (Priest) on Jun 28, 2005 at 19:29 UTC
Interesting observation - do you often see body tags enclosed in comments? You are assuming the poster doesn't want body tags enclosed in comments. ;) In any case, I think this pattern is better (added ^): `my $body = $the_html_string; $body =~ s/^.?(<body.?>)/$1/sgi;` [download]	[reply] [d/l]
Re^4: Regular Expression by CountZero (Bishop) on Jun 28, 2005 at 19:33 UTC
Re^4: Regular Expression by CountZero (Bishop) on Jun 28, 2005 at 19:48 UTC
Re^2: Regular Expression by Anonymous Monk on Jun 28, 2005 at 19:30 UTC
THat worked. Thanks!	[reply]
Re: Regular Expression by l.frankline (Hermit) on Jun 29, 2005 at 14:55 UTC
`u can try like this....` `$_ =~ /<body[^>]>/;` ` Frank *`	[reply] [d/l] [select]
Re: Regular Expression by Anonymous Monk on Jun 28, 2005 at 19:14 UTC
Unfortunately for my purpose I cannot use that module :-) I need to use strict regex	[reply]
Re^2: Regular Expression by CountZero (Bishop) on Jun 28, 2005 at 19:21 UTC
Homework? Or some fundamental religious rule which forbids you to handle HTML other than with a regex? CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^3: Regular Expression by waswas-fng (Curate) on Jun 28, 2005 at 19:22 UTC
ROTFL. -Waswas	[reply]
Re^2: Regular Expression by Transient (Hermit) on Jun 28, 2005 at 19:17 UTC
<flame suit on> `$html =~ /(<body[^>]+>)/s;` [download] `$html =~ /(<body[^>]>)/s;` [download] that should get you the opening tag. Update: Changed + to as pointed out by kwaping	[reply] [d/l] [select]
Re^3: Regular Expression by ikegami (Patriarch) on Jun 28, 2005 at 19:35 UTC
That'll fail for `<body onload="if (a > b) { ... }">`. It can also fail if `<body>` is found in comments.	[reply] [d/l] [select]
Re^4: Regular Expression by Transient (Hermit) on Jun 28, 2005 at 19:42 UTC
Re^3: Regular Expression by CountZero (Bishop) on Jun 28, 2005 at 19:29 UTC
And how do you find the closing tag? CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^4: Regular Expression by Transient (Hermit) on Jun 28, 2005 at 19:36 UTC
Re^5: Regular Expression by ikegami (Patriarch) on Jun 28, 2005 at 19:38 UTC
Re^5: Regular Expression by CountZero (Bishop) on Jun 28, 2005 at 19:54 UTC
Some notes below your chosen depth have not been shown here
Re^4: Regular Expression by ikegami (Patriarch) on Jun 28, 2005 at 19:36 UTC


Perl Monk, Perl Meditation
	PerlMonks