Removing selective tags and content between

diamich has asked for the wisdom of the Perl Monks concerning the following question:

Hello Wise Ones ;) I've sought help here once before and got the answer I needed, so I thought I'd try again on a different problem. I have two domains and want to dynamically pull content from one into the other. I managed to find a free script that allows me to do this, but it doesn't do any clean up. I've read that extra <html> <head> and <body> tags aren't a problem for IE, but they can cause errors in Nav. Here's the code snippet I have that will get my content from my site:

    $uatopasson = $ENV{"HTTP_USER_AGENT"};
    $referertopasson = $ENV{"HTTP_REFERER"};
    $ua = LWP::UserAgent->new;
    $ua->agent($uatopasson);
    $req = HTTP::Request->new (GET => "$file");  
    $req->header('referer' => $referertopasson);
    $res = $ua->request($req);
    $webpage = $res->content;
    print "Content-type: text/html\n\n";

    print $webpage;
[download]

What do I add to it before printing $webpage to remove the <html>, </html>, <body>, </body>, <head>, and </head> tags (as well as everything that falls between <head> </head>)? I'm also guessing that I would no longer need: print "Content-type: text/html\n\n"; Correct? Thanks in advance for any help you guys could give me.

Comment on Removing selective tags and content between Download Code

Replies are listed 'Best First'.
Re: Removing selective tags and content between by Ovid (Cardinal) on Oct 15, 2003 at 14:15 UTC
Try HTML::TokeParser::Simple. It will handle most of your parsing needs. `use strict; use warnings; use HTML::TokeParser::Simple; my $page = do { local $/; <DATA> }; my $parser = HTML::TokeParser::Simple->new(\$page); my $html = ''; $parser->get_tag('body'); # skip to first body tag while (my $token = $parser->get_token) { last if $token->is_end_tag('body'); $html .= $token->as_is; } print $html; __END__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <title>test</title> </head> <body> <h1>headline</h1> <p>Content</p> </body> </head>` [download] Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re: Re: Removing selective tags and content between by diamich (Initiate) on Oct 15, 2003 at 14:33 UTC
Thanks for the reply Ovid. Can I ask that you be a little more explicit? I'm TOTALLY Perl illiterate and since your solution doesn't use any of the variables I have, I can't even hazard a guess as to where I should put it or which variables I should change to match mine (or vice versa). Additionally, I should probably add that the two domains are on different hosts if that makes a difference.	[reply]
Re: Re: Re: Removing selective tags and content between by Ovid (Cardinal) on Oct 15, 2003 at 16:07 UTC
use HTML::TokeParser::Simple; $uatopasson = $ENV{"HTTP_USER_AGENT"}; $referertopasson = $ENV{"HTTP_REFERER"}; $ua = LWP::UserAgent->new; $ua->agent($uatopasson); $req = HTTP::Request->new (GET => "$file"); $req->header('referer' => $referertopasson); $res = $ua->request($req); $webpage = $res->content; print "Content-type: text/html\n\n"; $parser = HTML::TokeParser::Simple->new(\$webpage); $html = ''; $parser->get_tag('body'); # skip to first body tag while (my $token = $parser->get_token) { last if $token->is_end_tag('body'); $html .= $token->as_is; } print $html; [download] I would also like to point out that I've very reluctantly left off "use strict" and warnings. Check the link to my CGI course (below) for more information. Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re: Removing selective tags and content between by ChrisR (Hermit) on Oct 15, 2003 at 14:56 UTC
Not having anywhere near the experience of Ovid, I will offer a different solution. You could use a fairly simple regex to solve your problem, I think. In order to get rid of the HEAD tag and it's contents, try: `$webpage =~ s/<head>.+<\/head>//sgi;` [download] To get rid of the other tags you mentioned, try: `$webpage =~ s/<html>\|<\/html>\|<body>\|<\/body>//sgi;` [download] Keep in mind that this is a very narrow approach and will mis certain things like <body bgcolor="#FFF000">. A modification tothe regex will fix this though: `$webpage =~ s/<body.+>//sgi;` [download] There may be many other anomalies that you may have to take into consideration as well. One thing you can count on: you can't count on two people to format a line the same way.	[reply] [d/l] [select]
Re: Re: Removing selective tags and content between by diamich (Initiate) on Oct 15, 2003 at 15:17 UTC
Thanks Chris. I added the first two lines and tried it...worked great but left behind the line that started with <body background=.... So I added the third line you suggested and it took out all the content that was after that body tag as well....so the fetched page came up blank. I'm not sure if that was only because I tried this strictly with the fetching script alone without using an include statement in the page I wanted to place the content. Would the lack of that a body tag cause all the content to disappear?	[reply]
Re: Re: Re: Removing selective tags and content between by ChrisR (Hermit) on Oct 15, 2003 at 15:29 UTC
As far as I know, the lack of a body tag should not keep the browser from renedering the page. Most browsers are pretty forgiving when it comes that kind of stuff. A combination of the lines I showed above could be done as: `$test =~ s/<head>.+<\/head>\|<html>\|<\/html>\|<body.*?>\|<\/body>//sgi;` [download] I can't see why this would have cleared the entire string but then again, I haven't seen the entire string you are trying to parse. Perhaps you could post a little more...	[reply] [d/l]
Re: Re: Re: Re: Removing selective tags and content between by diamich (Initiate) on Oct 15, 2003 at 16:10 UTC
Re: Re: Re: Re: Removing selective tags and content between by diamich (Initiate) on Oct 15, 2003 at 18:06 UTC