in reply to Stripping tags from a PerlMonks page.
Here's an HTML::TokeParser solution. The output is kind of messy, but it works.
#!/usr/bin/perl -w use strict; use HTML::TokeParser; my $filename = $ARGV[0] or die 'not enough arguments'; my $parser = new HTML::TokeParser ($filename); while (my $token = $parser->get_token()) { my ($type, $tag) = ($token->[0], $token->[1]); # We don't want <layer> or <iframe> tags next if $tag eq "layer" || $tag eq "iframe"; # We can stop reading when we hit the nodelets section last if $type eq "C" && $tag eq "<!-- nodelets start here -->" +; # Print the token's text. All the token types except T # have their text as their last element. How annoying. if ($type eq "T") { print $tag; } else { print $token->[$#{$token}]; } } # Add a closing </table>. Netscape won't display a table if the tags +aren't # balanced. print "</table>\n"; # EOF
-Matt
|
|---|