This brief script will try to extract the content from HTML fed to it. It isn't very smart, but
it does the work I want it to, and I think the concept is, at least, sound---even if what I had to
do to HTML::Element isn't very pretty!
It works okay on the slashdot main page (in fact the result looks a lot like the `minimal' slashdot
theme) and CNN story pages. I expect with more tweaking it would do the Right Thing to a great many
other weblogs.
The astute among you will see it doesn't use regexes to parse HTML; the even more astute will see
it does not always generate valid or well-formed HTML.
#!/usr/bin/perl -w
=head1 COPYRIGHT
Copyright 2001 Jason Henry Parker
This program is Free Software; you can redistribute it
and/or modify it under the same terms as Perl itself.
=cut
use strict;
use integer;
use Carp qw(carp croak);
use HTML::TreeBuilder;
my $tree = new HTML::TreeBuilder;
$tree->parse_file( shift||"index.html" );
$tree->add_scores(penalise => 1,
detach => 1);
my $everything = sub { 1 };
my @contents = $tree->look_down( $everything );
# TODO: extend this to use a Schwartzian transform instead
# of recalculating depth() * score() over and over and over.
#
# What this does is it takes all contents under $tree,
# and sorts by depth x score on all objects that are
# HTML::Elements (or subclasses).
@contents = sort {
$b->depth() * $b->score()
<=>
$a->depth() * $a->score()
} grep {
defined $_
and ref $_
and $_->isa("HTML::Element")
} @contents;
my $best = $contents[0];
# Look for the `body' tag. This is probably way overkill.
# Since we bother to do this (ostensibly to hand back
# `valid' HTML, we should really go to the trouble of
# looking up from $best to find a table tag, and so on and
# so forth. Feh.
my $body = $tree->look_down( sub { my $x = shift;
defined $x
and ref $x
and $x->isa("HTML::Element")
and $x->tag eq 'body'; } );
$body->detach_content();
$body->push_content($best);
# Use
# print $tree->as_Lisp_form(),"\n";
# for a view of the HTML where you can see the scores on
# each node.
#
# Fortunately, attributes beginning with "_" are stripped
# out.
print $tree->as_HTML(undef, " "),"\n";
exit 0;
### Everything below here is a subroutine.
# This is but a simple accessor method for our added
# attribute. The code would be much simpler if a perl 5.6
# lvalue sub could be used here instead.
sub HTML::Element::score {
shift->attr('_score', @_);
}
# The real work happens here. This method recursively adds
# scores to the parse tree; if the `detach' argument is
# supplied, negatively scored nodes are removed, but these
# usually won't be generated unless the `penalise' option is
# also added.
sub HTML::Element::add_scores {
my ($self, %args) = @_;
my $sub;
$sub = sub {
my $self = shift;
if (!defined $self) {
carp "undefined value passed to add_scores()\n";
return undef;
}
if (ref $self) {
if ($self->isa("HTML::Element")) {
$self->score(0);
foreach ($self->content_list()) {
$self->score($self->score + $sub->($_));
}
if ($args{penalise}) {
$self->penalise();
}
if ($args{detach} and $self->score <= 0) {
my $t = $self->detach();
if (defined $t
and $t->isa("HTML::Element")) {
$t->normalize_content();
}
}
return $self->score;
} else {
carp "unknown ref type passed to add_scores()\n";
return undef;
}
} else { # $self is not a reference
return length $self;
}
};
$sub->($self);
}
# Punish content-obsuring nodes, reward content-rich nodes.
# TODO: make this use HTML::Tagset TODO: make this sub
# changable at run-time to suit specific sites.
sub HTML::Element::penalise {
my $self = shift;
my $tag = $self->tag;
my $score = $self->score;
## These elements are considered Just Plain Evil. They
## almost always obscure content.
if ($tag eq 'script' or $tag eq 'span' or $tag eq 'form') {
# $score = - $score can make a score positive again
$score = -abs($score);
} elsif ($tag eq 'p') {
## these elements often are, or contain, useful content.
$score += 50;
} elsif ($tag eq 'a') {
## a tags can be a pain; we could be seeing an
## off-site link to supporting documentation,
## or we could be seeing a mess of navigation links.
$score = 1;
}
$self->score($score);
}
Re: HTML content extractor
by mirod (Canon) on Feb 10, 2001 at 22:31 UTC
|
#!/bin/perl
use HTML::Parser;
my $file= shift;
my $p = HTML::Parser->new(api_version => 3,
handlers => { text => [\@array, "text"] });
$p->parse_file( $file);
print $_->[0] foreach @array;
To keep the formatting I strongly suspect that
HTML::FormatText
will do a nice job too.
You can certainly re-invemt the wheel, but please try not
to lure others into using your not-so-round attempt. | [reply] [d/l] |
|
HTML::Parser->new(text_h => [sub{print @_}, "text"])->parse_file($file
+);
| [reply] [d/l] |
|
A:link {color:#333333;text-decoration:none}
A:visited {color:#333333;text-decoration:none}
A:active {color:#333333;text-decoration:none}
A:hover {text-decoration:underline; color:#0099ff;}
.mp_bonmun {
font-family: "µ¸¿ò";
font-size: 9pt;
font-style: normal;
line-height: 17pt;
font-weight: normal;
font-variant: normal;
color: #333333;
text-align: justify;
text-indent: 10pt;
}
.mp_pop_title {
font-family: "µ¸¿ò";
font-size: 10pt;
font-weight: bold;
color: #00067D;
}
.mp_4C {
font-family: "±¼¸²";
font-size: 9pt;
color: #4C4C4C;
}
.mp_point {
font-family: "µ¸¿ò";
font-size: 9pt;
font-style: normal;
line-height: 17pt;
font-weight: bold;
font-variant: normal;
color: #3399CC;
}
.mp_title1 {
font-size: 9pt;
font-style: normal;
line-height: 17pt;
font-weight: bold;
font-variant: normal;
color: #3495C2;
font-family: "µ¸¿ò";
}
.mp_title2 {
font-family: "µ¸¿ò";
font-size: 10pt;
font-style: normal;
line-height: 17pt;
font-weight: bold;
font-variant: normal;
color: #4E53A7;
}
.mp_title3 {
font-size: 9pt;
font-style: normal;
line-height: 17pt;
font-weight: bold;
font-variant: normal;
color: #F6A026;
font-family: "µ¸¿ò";
}
.mp_title4 {
font-size: 9pt;
font-style: normal;
line-height: 17pt;
font-weight: bold;
font-variant: normal;
color: #71C601;
font-family: "µ¸¿ò";
}
table {
font-family: "µ¸¿ò";
font-size: 9pt;
line-height: 17pt;
color: #333333;
text-align: justify;
}
.maintb table{
word-break:break-all;
table-layout:fixed;
white-space: nowrap;
}
.maintb td{
font-family: "µ¸¿ò";
font-size: 9pt;
line-height: 17pt;
color: #333333;
text-align: justify;
word-break:break-all;
table-layout:fixed;
}
.input01 { background-color:white;border:1 groove #CCCCCC ; font-family:µ¸¿ò; font-size:9pt;font-color:#555555}
.input02 { background-color:#f8f8f8;border:0 solid #D6D6D6 ; font-family:µ¸¿ò; font-size:9pt;font-color:#555555}
#wow_box {
width: 517;
height: auto;
overflow: auto;
border:0 solid;
background-color:#FFFFFF;
scrollbar-3dlight-color:#CCCCCC;
scrollbar-base-color: #FFFFFF;
scrollbar-shadow-color:#CCCCCC;
scrollbar-arrow-color: #888888;
scrollbar-face-color: #FFFFFF;
text-align: center;
vertical-align: middle;
}
#agree_box {
width: 509;
height: 350;
overflow: auto;
padding:7px;
border:1px solid #CCCCCC;
background-color:#FFFFFF;
font-size: 12px;
line-height: 20px;
scrollbar-3dlight-color:#CCCCCC;
scrollbar-base-color: #FFFFFF;
scrollbar-shadow-color:#CCCCCC;
scrollbar-arrow-color: #888888;
scrollbar-face-color: #FFFFFF;
text-align: left;
}
#maga_box {
width: 400;
height: 120;
overflow: auto;
padding:7px;
border:0 solid #CCCCCC;
background-color:#FFFFFF;
font-size: 12px;
line-height: 20px;
scrollbar-3dlight-color:#CCCCCC;
scrollbar-base-color: #FFFFFF;
scrollbar-shadow-color:#CCCCCC;
scrollbar-arrow-color: #888888;
scrollbar-face-color: #FFFFFF;
text-align: left;
}
#pp_box {
width: 312;
height: 80;
overflow: auto;
padding:5px;
background-color:#FFFFFF;
font-size: 12px;
line-height: 20px;
scrollbar-3dlight-color:#CCCCCC;
scrollbar-base-color: #FFFFFF;
scrollbar-shadow-color:#CCCCCC;
scrollbar-arrow-color: #888888;
scrollbar-face-color: #FFFFFF;
border-top: 0 dashed #CCCCCC;
border-right: 0 dashed #CCCCCC;
border-bottom: 0 dashed #CCCCCC;
border-left: 0 dashed #CCCCCC;
text-align: left;
}
.toc {
font-family: "µ¸¿ò";
font-size: 12px;
color: #333333;
line-height: 20px;
white-space: nowrap;
}
.toc td{
vertical-align: top;
border-bottom-width: 0px;
border-top-style: none;
border-right-style: none;
border-bottom-style: dashed;
border-left-style: none;
}
.bar td{
font-family: "µ¸¿ò";
font-size: 12px;
line-height: 14px;
color: #FFFFFF;
padding-top: 2px;
}
.page {
font-family: "µ¸¿ò";
font-size: 11px;
color: #3399CC;
line-height: 20px;
white-space: nowrap;
}
.pageform {
font-family: "µ¸¿ò";
font-size: 11px;
color: #3399CC;
line-height: 14px;
white-space: nowrap;
border: 1px solid #CCCCCC;
overflow: hidden;
height: 14px;
width: 30px;
margin-top: 3px;
margin-bottom: 3px;
}
.cateform {
font-family: "µ¸¿ò";
font-size: 11px;
color: #000000;
line-height: 14px;
white-space: nowrap;
height: 14px;
width: 130px;
overflow: hidden;
border-top: 1px solid #CCCCCC;
border-right: 1px none #CCCCCC;
border-bottom: 1px solid #CCCCCC;
border-left: 1px none #CCCCCC;
margin-top: 3px;
margin-bottom: 3px;
}
.titleform {
font-family: "µ¸¿ò";
font-size: 11px;
color: #000000;
line-height: 20px;
white-space: nowrap;
height: 14px;
width: 240px;
overflow: hidden;
border: 1px solid #CCCCCC;
margin-top: 3px;
margin-bottom: 3px;
}
.staff {
font-family: "µ¸¿ò";
font-size: 12px;
color: #6699CC;
text-decoration: none;
}
.staff a:link{
color:#AAAAAA;
text-decoration:none;
font-size: 11px;
font-family: "Verdana", "Arial", "Helvetica", "sans-serif";
}
.staff a:visited{color:#AAAAAA;text-decoration:none;font-size: 11px;font-family: "Verdana", "Arial", "Helvetica", "sans-serif";}
.staff a:active{color:#AAAAAA;text-decoration:none;font-size: 11px;font-family: "Verdana", "Arial", "Helvetica", "sans-serif";}
.staff a:hover{color:#3399CC;text-decoration:none;font-size: 11px;font-family: "Verdana", "Arial", "Helvetica", "sans-serif";}
b {
font-weight: bold;
color: #3399CC;
}
.scb td{
font-family: "µ¸¿ò";
font-size: 12px;
color: #336699;
text-decoration: none;
line-height: 24px;
}
.receipt td{
font-family: "µ¸¿ò";
font-size: 12px;
color: #000000;
text-decoration: none;
line-height: 24px;
}
.login td{
font-family: "µ¸¿ò";
font-size: 12px;
color: #336699;
text-decoration: none;
line-height: 16px;
}
.version {
color:#FFFFFF;
font-size: 10px;
font-family: "Helvetica", "sans-serif", "Arial",;
margin-bottom: -2px;
margin-right: -20px;
}
.barlink a:link {
color:#FFFFFF;
text-decoration:none;
font-size: 12px;
font-family: "µ¸¿ò";
}
.barlink a:visited {
color:#FFFFFF;
text-decoration:none;
font-size: 12px;
font-family: "µ¸¿ò";
}
.barlink a:hover {
color:#FFFFFF;
text-decoration:none;
font-size: 12px;
font-family: "µ¸¿ò";
}
.barlink a:active {
color:#FFFFFF;
text-decoration:none;
font-size: 12px;
font-family: "µ¸¿ò";
}
| [reply] |
|
Did you run the program?
Look at what happens when both programs are given
the HTML
in this CNN story.
That is not a canned example---I simply looked at what was on CNN
right now, downloaded it, and asked my program to
search it for content. (Granted, it doesn't run perfectly on that
input---the first few paragraphs are elided---but your program does a
truly woeful job: to extract the content from what comes back would
require much more work than it does if the HTML syntax and structure
is there to help.)
Of course I looked at the HTML::Parser
module. I'm using HTML::TreeBuilder for any number of
good reasons.
Oh, and yes, HTML::FormatText would work, except it will
not render forms and tables, making it completely useless for dealing
with the vast majority of weblogs and news sites out there.
The point of the matter is my `not-so-round attempt' works better
than your approach ever will. I defy you to do better without doing
something at least as complex (and I don't consider what I've written
to be terribly complex).
| [reply] [d/l] [select] |
|
My sincere apologies.
When I read the description of your code you provided I
assumed you had written yet-another-html-pseudo-parser.
Which you have not. That will teach me to answer posts
when I am tired (and too fast).
Once I started actually reading I found that your code
_is_ valuable. I also tried (of course!) to write
something similar but simpler, and haven't succeeded so
far (man, this CNN page is Hell!).
What I have managed though is to find a bug in XML::PYX
and one in XML::Twig, so I did not loose my time ;--)
Oh, and of course I upvoted the rest of your comments on the thread.
Sorry...
| [reply] |
|
Re: HTML content extractor
by japhy (Canon) on Feb 11, 2001 at 12:24 UTC
|
If all you want to do is extract the text content from an HTML document, you can use YAPE::HTML like so:
use YAPE::HTML;
my $parser = YAPE::HTML->new("...");
while (my $chunk = $parser->next) {
print $chunk->text if $chunk->type eq 'text'
}
Oh, I'm going to be reworking the module to be able to read in chunks (so you can send it a filehandle, instead of a string).
japhy --
Perl and Regex Hacker | [reply] [d/l] |
|
If all you want to do is extract the text content from an HTML document, you can use YAPE::HTML
like so:
Yes, and if extracting text was all I wanted to do, that's how I'd do
it.
The point of this CUFP is to extract
content---important text that would appear in a
rendered HTML page---as opposed to non-content,
such as the comments, the javascript, the unnecessary tags and other
fluff, which can't be reliably removed without some idea of the
document structure, which is readily available with a parse tree or
similar but not with a simple variation on HTML::Parser which can't
easily provide some context or easy
document manipulation.
Usually, a parse tree would be readily available through a DOM or XSLT,
or a DTD or something, but most HTML is not written well enough to
manipulate this way, so I'm using HTML::TreeBuilder to create the parse
tree for me, since it provides excellent support for parsing ambiguous
elements like a browser would.
Obviously I am not communicating my idea well, or this code is not as
good as I think it is, or something. To try to alleviate this problem,
I'll include the POD for the program here:
=head1 NAME
html-extract.pl - extract the content from a HTML page
=cut
=head1 SYNOPSIS
$ perl html-extract foo.html >| newfoo.html
$ w3m -dump newfoo.html
=cut
=head1 DESCRIPTION
F<html-extract.pl> works by reading the file named
as its argument (or `index.html') and creating a
F<HTML::TreeBuilder> parse tree from it. Then, using some
added methods to F<HTML::Element>, the program searches the
tree for the `best' node (currently defined as deepest,
highest-scoring node).
Nodes are scored very simplistically---a node's score is the
sum of all the scores of its contents; the score of a text
element is its length. Some nodes are penalised for being
obfuscatory, others are rewarded for being traditionally
associated with content. Any node that scores negatively is
automatically deleted from the parse tree.
After finding the best node, the head tag is preserved,
the body tag's contents removed and replaced with the
aforementioned best node.
The parse tree is then printed as HTML to standard output.
=cut
=head1 CAVEATS
=over 4
=item o
The software is not well-tested; it worked on slashdot and a
CNN story page when the author tried it.
=item o
There is no way to customise the behaviour of the software
except to edit the source code.
=back
=cut
=head1 COPYRIGHT
Copyright 2001 Jason Henry Parker
This program is Free Software; you can redistribute it
and/or modify it under the same terms as Perl itself.
=cut
=head1 SEE ALSO
L<HTML::Element>; L<HTML::TreeBuilder>.
=cut
For anyone still interested in looking at the output of the program, I
recommend either the lynx or w3m text browsers, which will render as text
to a terminal or tty if passed the -dump argument.
| [reply] [d/l] |
Re: HTML content extractor
by ichimunki (Priest) on Feb 11, 2001 at 07:48 UTC
|
I actually ran this. I'm not sure what it's supposed to do that makes it superior to the HTML::Parser quickies being kicked around, but it doesn't.
I ran it on a pod2html page of mine and all I got was <html>
<head>
<title>My document title</title>
</head>
</html>
I also ran it on the HTML from this node and got a can't call detach_content method on undefined error. | [reply] [d/l] |
|
I actually ran this. I'm not sure what it's supposed to do that makes
it superior to the HTML::Parser quickies being kicked around, but it
doesn't.
Thanks for running the software. For an idea of what it is supposed to
do, download the source of, say, a Wired or CNN news article, and run
that past the program. Those are two types of input documents that I
know work well.
Yes, unfortunately it is far from perfect. The intent is to use it on
busy weblog and news portal sites to automatically download and trim
out things like sidebars, boxes interrupting the flow of text, headers
and footers. So yes, I'm not surprised it didn't do too well on a POD
page---it assumes there's something to be found, but
this assumption doesn't work well on a document that is pretty much all
content and no distraction.
What's supposed to make it superior to HTML::Parser quickies (and I've
written a few of them in my time) is that it doesn't have to be told how
to interpret a given page. This may have to change in the future (the
range of HTML out there is pretty big!) but I'm confident the approach
is robust enough that with work it'll be a killer. If anyone
has a HTML::Parser quickie that works in the general case, I'd be very
pleased to see it.
The error you got is very unfortunate and wholly my fault for posting
something so premature.
| [reply] |
|
|