Stripping tags from a PerlMonks page.

dmtelf has asked for the wisdom of the Perl Monks concerning the following question:

To prepare a typical PerlMonks page for a Palm PDA, I'd like to remove all references to ads & all the nodelets such as the chatterbox etc.

I need to strip one line and a range of lines between two tags.

I'm not sure on the best way to do this so would appreciate advice from fellow Monks. My initial attempts were miserable convoluted failures.

As an example, save the source HTML for this page onto HD as example.htm.

Given a filename which is a saved PerlMonk page (e.g. example.htm),

Strip out:

Entire <layer src="http://adfu.blockstackers.com/servfu.pl....."line. (the line above appears after <BODY text="#000000" bgcolor="#FFFFFF" link="#000066" vlink="#333399">)

From  to last </CENTER>(before </BODY>)

Save the rest under a different filename.

Perhaps this should be a feature of the PerlMonks site like those websites which offer a button "Prepare this page for printing/emailing" etc alongside all articles etc. What do you think?

dmtelf

Comment on Stripping tags from a PerlMonks page. Select or Download Code

Replies are listed 'Best First'.
Re: Stripping tags from a PerlMonks page. by davorg (Chancellor) on Jul 25, 2000 at 15:30 UTC
I think that your best bet for this task would be to use HTML::Parser or one of its subclasses (HTML::TokeParser seems most appropriate). Any solution that doesn't use a real parser will come up against limitations eventually. -- <http://www.dave.org.uk> European Perl Conference - Sept 22/24 2000, ICA, London <http://www.yapc.org/Europe/>	[reply]
RE: Stripping tags from a PerlMonks page. by DrManhattan (Chaplain) on Jul 25, 2000 at 20:09 UTC
Here's an HTML::TokeParser solution. The output is kind of messy, but it works. #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $filename = $ARGV[0] or die 'not enough arguments'; my $parser = new HTML::TokeParser ($filename); while (my $token = $parser->get_token()) { my ($type, $tag) = ($token->[0], $token->[1]); # We don't want <layer> or <iframe> tags next if $tag eq "layer" \|\| $tag eq "iframe"; # We can stop reading when we hit the nodelets section last if $type eq "C" && $tag eq "<!-- nodelets start here -->" +; # Print the token's text. All the token types except T # have their text as their last element. How annoying. if ($type eq "T") { print $tag; } else { print $token->[$#{$token}]; } } # Add a closing </table>. Netscape won't display a table if the tags +aren't # balanced. print "</table>\n"; # EOF [download] -Matt	[reply] [d/l]
Re: Stripping tags from a PerlMonks page. by fundflow (Chaplain) on Jul 25, 2000 at 17:30 UTC
Here is a quick-and-dirty solution. According to your description, it might suffice `while(<>) { next if (/line that i want to skip/); $skip=1 if(/line marking beginning of skip block/); $skip=0 if(/line marking end of skip block/); print unless ($skip \|\| ($.<$startline && $.>endline); }` [download] This lets you define the line you want to skip, a block delimited by two special lines and lines in a certain numeric range.	[reply] [d/l]
RE: Re: Stripping tags from a PerlMonks page. by davorg (Chancellor) on Jul 25, 2000 at 17:42 UTC
Parsing HTML using regular expressions is always a dangerous affair - especially if you have no control over the HTML content - but if you were going to use this approach you could simplify it by using the '..' operator. `while (<>) { next if /line to skip/; print unless ($startline .. $endline) or (/skip start line/ .. /skip end line); }` [download] -- <http://www.dave.org.uk> European Perl Conference - Sept 22/24 2000, ICA, London <http://www.yapc.org/Europe/>	[reply] [d/l]
Re: Stripping tags from a PerlMonks page. by turnstep (Parson) on Jul 25, 2000 at 22:48 UTC
Here's a quick and dirty solution I came up with. Ideally, you should get with vroom about getting a "basic info, no fancy stuff" data feed, if there isn't something already (similar to the slashdot style info boxes). Anyway, here it is: `{ open(PAGE, "$PageToParse.html") or die "Could not open: $!\n"; local $/; while (<PAGE>) { m#<TITLE>(.)</TITLE># and $title=$1; s#^.?</TABLE>##s; s#<!-- nodelets start.*##s; print $_; ## Or to a new file, etc. } }` [download] Short, ugly, and to the point. The Title is the only thing of value I can see keeping up until the end of the first TABLE tag. Jettison all that, jettison everything after the nodelets, and add back in stuff like the title, <BODY>, </BODY>, etc. as you desire.	[reply] [d/l]