in reply to Removing selective tags and content between

Not having anywhere near the experience of Ovid, I will offer a different solution. You could use a fairly simple regex to solve your problem, I think. In order to get rid of the HEAD tag and it's contents, try:
$webpage =~ s/<head>.+<\/head>//sgi;
To get rid of the other tags you mentioned, try:
$webpage =~ s/<html>|<\/html>|<body>|<\/body>//sgi;
Keep in mind that this is a very narrow approach and will mis certain things like <body bgcolor="#FFF000">. A modification tothe regex will fix this though:
$webpage =~ s/<body.+>//sgi;
There may be many other anomalies that you may have to take into consideration as well. One thing you can count on: you can't count on two people to format a line the same way.

Replies are listed 'Best First'.
Re: Re: Removing selective tags and content between
by diamich (Initiate) on Oct 15, 2003 at 15:17 UTC
    Thanks Chris. I added the first two lines and tried it...worked great but left behind the line that started with <body background=.... So I added the third line you suggested and it took out all the content that was after that body tag as well....so the fetched page came up blank. I'm not sure if that was only because I tried this strictly with the fetching script alone without using an include statement in the page I wanted to place the content. Would the lack of that a body tag cause all the content to disappear?
      As far as I know, the lack of a body tag should not keep the browser from renedering the page. Most browsers are pretty forgiving when it comes that kind of stuff. A combination of the lines I showed above could be done as:
      $test =~ s/<head>.+<\/head>|<html>|<\/html>|<body.*?>|<\/body>//sgi;
      I can't see why this would have cleared the entire string but then again, I haven't seen the entire string you are trying to parse. Perhaps you could post a little more...
        Chris...when you replied to my last post, I noticed that your statement had changed from <body.+> to <body.*?> When I used <body.*?> it made all the difference and everything worked fine :)) Thanks so much for the help!
        Thanks for all the help Chris. When you replied I noticed that you said <body.*> when you had originally told me to use <body.+>....I put in <body.*> and everything worked fine. Thank you again.