in reply to Getting Words out of HTML :)
You can tackle this two ways if all you want/need is quick and dirty:
I don't have any experience with HTML::Parse, so YMMV -- for more complicated HTML I'd go with HTML::Parse as they've probably thought through the issues much more clearly than I have. But for simple HTML and a limited number of files you might be able to just script it in less than 20 lines.
Hope this helps.
Below is a version that works -- it's not terrible sophisticated, but it will catch *most* tags and not mangle your page too badly.
<CODE>#!/usr/local/perl5 $OUT = "/u/jreades/"; undef $/; while (my $file = shift) { open(IN, "<" . $file) or die ("Couldn't open file (" . $file . ") +to read: " . $!); open(OUT, ">" . $OUT . "test.html") or die ("Couldn't open file (" + . $OUT . $file . ") to write: " . $!); print STDOUT "Reading: " . $file . "\n"; print STDOUT "Writing: " . $OUT . "test.html\n"; my $keywords, $description, $tag; my $text = <IN>; while ($text =~ s/<([^>]+)>//) { my $tag = $1; if ($tag =~ /NAME="(keywords|description)/i) { my $meta = $+; (${$meta}) = $tag =~ /CONTENT="([^"]*)"/i; print OUT $meta . ": " . ${$meta} . "\n\n"; } } $text =~ s/\n{3,}/\n/g; print OUT $text; close IN; close OUT; } exit 0;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
RE: Re: Getting Words out of HTML :)
by jreades (Friar) on Aug 30, 2000 at 03:05 UTC | |
by jreades (Friar) on Aug 30, 2000 at 03:12 UTC |