diamich has asked for the wisdom of the Perl Monks concerning the following question:

Hello Wise Ones ;) I've sought help here once before and got the answer I needed, so I thought I'd try again on a different problem. I have two domains and want to dynamically pull content from one into the other. I managed to find a free script that allows me to do this, but it doesn't do any clean up. I've read that extra <html> <head> and <body> tags aren't a problem for IE, but they can cause errors in Nav. Here's the code snippet I have that will get my content from my site:
$uatopasson = $ENV{"HTTP_USER_AGENT"}; $referertopasson = $ENV{"HTTP_REFERER"}; $ua = LWP::UserAgent->new; $ua->agent($uatopasson); $req = HTTP::Request->new (GET => "$file"); $req->header('referer' => $referertopasson); $res = $ua->request($req); $webpage = $res->content; print "Content-type: text/html\n\n"; print $webpage;
What do I add to it before printing $webpage to remove the <html>, </html>, <body>, </body>, <head>, and </head> tags (as well as everything that falls between <head> </head>)? I'm also guessing that I would no longer need: print "Content-type: text/html\n\n"; Correct? Thanks in advance for any help you guys could give me.

Replies are listed 'Best First'.
Re: Removing selective tags and content between
by Ovid (Cardinal) on Oct 15, 2003 at 14:15 UTC

    Try HTML::TokeParser::Simple. It will handle most of your parsing needs.

    use strict; use warnings; use HTML::TokeParser::Simple; my $page = do { local $/; <DATA> }; my $parser = HTML::TokeParser::Simple->new(\$page); my $html = ''; $parser->get_tag('body'); # skip to first body tag while (my $token = $parser->get_token) { last if $token->is_end_tag('body'); $html .= $token->as_is; } print $html; __END__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <title>test</title> </head> <body> <h1>headline</h1> <p>Content</p> </body> </head>

    Cheers,
    Ovid

    New address of my CGI Course.

      Thanks for the reply Ovid. Can I ask that you be a little more explicit? I'm TOTALLY Perl illiterate and since your solution doesn't use any of the variables I have, I can't even hazard a guess as to where I should put it or which variables I should change to match mine (or vice versa). Additionally, I should probably add that the two domains are on different hosts if that makes a difference.
        use HTML::TokeParser::Simple; $uatopasson = $ENV{"HTTP_USER_AGENT"}; $referertopasson = $ENV{"HTTP_REFERER"}; $ua = LWP::UserAgent->new; $ua->agent($uatopasson); $req = HTTP::Request->new (GET => "$file"); $req->header('referer' => $referertopasson); $res = $ua->request($req); $webpage = $res->content; print "Content-type: text/html\n\n"; $parser = HTML::TokeParser::Simple->new(\$webpage); $html = ''; $parser->get_tag('body'); # skip to first body tag while (my $token = $parser->get_token) { last if $token->is_end_tag('body'); $html .= $token->as_is; } print $html;

        I would also like to point out that I've very reluctantly left off "use strict" and warnings. Check the link to my CGI course (below) for more information.

        Cheers,
        Ovid

        New address of my CGI Course.

Re: Removing selective tags and content between
by ChrisR (Hermit) on Oct 15, 2003 at 14:56 UTC
    Not having anywhere near the experience of Ovid, I will offer a different solution. You could use a fairly simple regex to solve your problem, I think. In order to get rid of the HEAD tag and it's contents, try:
    $webpage =~ s/<head>.+<\/head>//sgi;
    To get rid of the other tags you mentioned, try:
    $webpage =~ s/<html>|<\/html>|<body>|<\/body>//sgi;
    Keep in mind that this is a very narrow approach and will mis certain things like <body bgcolor="#FFF000">. A modification tothe regex will fix this though:
    $webpage =~ s/<body.+>//sgi;
    There may be many other anomalies that you may have to take into consideration as well. One thing you can count on: you can't count on two people to format a line the same way.
      Thanks Chris. I added the first two lines and tried it...worked great but left behind the line that started with <body background=.... So I added the third line you suggested and it took out all the content that was after that body tag as well....so the fetched page came up blank. I'm not sure if that was only because I tried this strictly with the fetching script alone without using an include statement in the page I wanted to place the content. Would the lack of that a body tag cause all the content to disappear?
        As far as I know, the lack of a body tag should not keep the browser from renedering the page. Most browsers are pretty forgiving when it comes that kind of stuff. A combination of the lines I showed above could be done as:
        $test =~ s/<head>.+<\/head>|<html>|<\/html>|<body.*?>|<\/body>//sgi;
        I can't see why this would have cleared the entire string but then again, I haven't seen the entire string you are trying to parse. Perhaps you could post a little more...