strip header from page fetched w/ LWP

geektron has asked for the wisdom of the Perl Monks concerning the following question:

this isn't a clean solution by any stretch of the imagination. one of the apps i'm dealing with the previous designer fetches a page from another server using LWP, dumps it into a variable, and passes that variable into a (badly done) HTML::Template::load_tmpl wrapper.

while i'd love to spend the time cleaning it up, i just need to fix a display problem right now. because the headers are fetched, all the CSS information also gets fetched ( along w/ title, etc ), and it's blowing up the display.

here's what i've been trying to do, without much success:


## dummy URL
my $out = get("http://www.webpage.com");
                                                                      
+                                                                     
+                
$out =~ s/\cM//g;

## doesn't seem to match 
$out =~ s#^<head>[\w|\s]+</head>#<!-- header removed -->#mio;
$out =~ s/<head>(.*?)</head>#<!-- header removed -->/im;
                                                                      
+                                                                     
+              
## these work ... but only take out one line
#$out =~ s#<html>#<!-- header removed -->#mi;
#$out =~ s#<title>(.*)</title>#<!-- header removed -->#mi;
#$out =~ s#<meta(.*)>#<!-- header removed -->#mi;

## also doesn't work
$out =~ s#<style type(.*?)</style>#<!-- header removed -->#im;
[download]

there may be a better, non-regex way ( and i'm open to suggestions ), but it seemed a brute force regex answer would be quick ....

Comment on strip header from page fetched w/ LWP Select or Download Code

Replies are listed 'Best First'.
Re: strip header from page fetched w/ LWP by Fletch (Bishop) on Feb 09, 2004 at 17:34 UTC
The surest way would be to use something like HTML::TreeBuilder or HTML::Parser to parse the contents and then extract everything from the <body> you're interested in.	[reply]
Re: Re: strip header from page fetched w/ LWP by geektron (Curate) on Feb 09, 2004 at 17:42 UTC
well, since the markup is already there, and all i needed to do was exactly what the anon suggestion was .... your recommendation is a little over-the-top for now.	[reply]
Re: Re: Re: strip header from page fetched w/ LWP by hardburn (Abbot) on Feb 09, 2004 at 17:53 UTC
No, using an HTML parser is the correct solution. It's impossible to properly parse HTML with pure regexen (it's possible with Perl's extended regexen, but it's still messy). It's hardly over the top; coding it with a parser would probably have taken as much time as it took for you to come up with broken regex solutions. ---- `: () { :\|:& };:` Note: All code is untested, unless otherwise stated	[reply] [d/l]
Re: strip header from page fetched w/ LWP by Anonymous Monk on Feb 09, 2004 at 17:37 UTC
Perhaps you didn't realize that . doesn't match \n unless you give the /s flag. `$out =~ s#<head>.*?</head>##s` [download]	[reply] [d/l]
Re: strip header from page fetched w/ LWP by Anonymous Monk on Feb 10, 2004 at 07:34 UTC
This is how easy it can be: `use strict; use warnings; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new(); $t->parse(q~ <html> <head> <title>eltit</title> </head> <body> rock me </body> </html> ~); $t->eof; die $t->find_by_tag_name('body')->as_HTML; __END__ <body> rock me </body>` [download]	[reply] [d/l]