geektron has asked for the wisdom of the Perl Monks concerning the following question:

this isn't a clean solution by any stretch of the imagination. one of the apps i'm dealing with the previous designer fetches a page from another server using LWP, dumps it into a variable, and passes that variable into a (badly done) HTML::Template::load_tmpl wrapper.

while i'd love to spend the time cleaning it up, i just need to fix a display problem right now. because the headers are fetched, all the CSS information also gets fetched ( along w/ title, etc ), and it's blowing up the display.

here's what i've been trying to do, without much success:

## dummy URL my $out = get("http://www.webpage.com"); + + $out =~ s/\cM//g; ## doesn't seem to match $out =~ s#^<head>[\w|\s]+</head>#<!-- header removed -->#mio; $out =~ s/<head>(.*?)</head>#<!-- header removed -->/im; + + ## these work ... but only take out one line #$out =~ s#<html>#<!-- header removed -->#mi; #$out =~ s#<title>(.*)</title>#<!-- header removed -->#mi; #$out =~ s#<meta(.*)>#<!-- header removed -->#mi; ## also doesn't work $out =~ s#<style type(.*?)</style>#<!-- header removed -->#im;
there may be a better, non-regex way ( and i'm open to suggestions ), but it seemed a brute force regex answer would be quick ....

Replies are listed 'Best First'.
Re: strip header from page fetched w/ LWP
by Fletch (Bishop) on Feb 09, 2004 at 17:34 UTC

    The surest way would be to use something like HTML::TreeBuilder or HTML::Parser to parse the contents and then extract everything from the <body> you're interested in.

      well, since the markup is already there, and all i needed to do was *exactly* what the anon suggestion was .... your recommendation is a little over-the-top for now.

        No, using an HTML parser is the correct solution. It's impossible to properly parse HTML with pure regexen (it's possible with Perl's extended regexen, but it's still messy). It's hardly over the top; coding it with a parser would probably have taken as much time as it took for you to come up with broken regex solutions.

        ----
        : () { :|:& };:

        Note: All code is untested, unless otherwise stated

Re: strip header from page fetched w/ LWP
by Anonymous Monk on Feb 09, 2004 at 17:37 UTC
    Perhaps you didn't realize that . doesn't match \n unless you give the /s flag.
    $out =~ s#<head>.*?</head>##s
Re: strip header from page fetched w/ LWP
by Anonymous Monk on Feb 10, 2004 at 07:34 UTC
    This is how easy it can be:
    use strict; use warnings; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new(); $t->parse(q~ <html> <head> <title>eltit</title> </head> <body> rock me </body> </html> ~); $t->eof; die $t->find_by_tag_name('body')->as_HTML; __END__ <body> rock me </body>