comment on

This brief script will try to extract the content from HTML fed to it. It isn't very smart, but it does the work I want it to, and I think the concept is, at least, sound---even if what I had to do to HTML::Element isn't very pretty!

It works okay on the slashdot main page (in fact the result looks a lot like the `minimal' slashdot theme) and CNN story pages. I expect with more tweaking it would do the Right Thing to a great many other weblogs.

The astute among you will see it doesn't use regexes to parse HTML; the even more astute will see it does not always generate valid or well-formed HTML.

#!/usr/bin/perl -w

=head1 COPYRIGHT

Copyright 2001 Jason Henry Parker 

This program is Free Software; you can redistribute it
and/or modify it under the same terms as Perl itself.

=cut

use strict;
use integer;

use Carp qw(carp croak);
use HTML::TreeBuilder;

my $tree = new HTML::TreeBuilder;

$tree->parse_file( shift||"index.html" );

$tree->add_scores(penalise => 1,
                  detach => 1);

my $everything = sub { 1 };

my @contents = $tree->look_down( $everything );

# TODO: extend this to use a Schwartzian transform instead
# of recalculating depth() * score() over and over and over.
#
# What this does is it takes all contents under $tree,
# and sorts by depth x score on all objects that are
# HTML::Elements (or subclasses).
@contents = sort {
                $b->depth() * $b->score()
                    <=>
                $a->depth() * $a->score()
            } grep { 
                    defined $_
                and ref $_
                and $_->isa("HTML::Element")
            } @contents;

my $best = $contents[0];

# Look for the `body' tag. This is probably way overkill.
# Since we bother to do this (ostensibly to hand back
# `valid' HTML, we should really go to the trouble of
# looking up from $best to find a table tag, and so on and
# so forth. Feh.
my $body = $tree->look_down( sub { my $x = shift;
                                   defined $x
                               and ref $x
                               and $x->isa("HTML::Element")
                               and $x->tag eq 'body'; } );

$body->detach_content();

$body->push_content($best);

# Use
#   print $tree->as_Lisp_form(),"\n";
# for a view of the HTML where you can see the scores on
# each node.
#
# Fortunately, attributes beginning with "_" are stripped
# out.
print $tree->as_HTML(undef, "  "),"\n";

exit 0;

### Everything below here is a subroutine.

# This is but a simple accessor method for our added
# attribute. The code would be much simpler if a perl 5.6
# lvalue sub could be used here instead.
sub HTML::Element::score {
   shift->attr('_score', @_);
}

# The real work happens here. This method recursively adds
# scores to the parse tree; if the `detach' argument is
# supplied, negatively scored nodes are removed, but these
# usually won't be generated unless the `penalise' option is
# also added.
sub HTML::Element::add_scores {
    my ($self, %args) = @_;

    my $sub;

    $sub = sub {
        my $self = shift;

        if (!defined $self) {
            carp "undefined value passed to add_scores()\n";
            return undef;
        }

        if (ref $self) {
            if ($self->isa("HTML::Element")) {
                $self->score(0);

                foreach ($self->content_list()) {
                    $self->score($self->score + $sub->($_));
                }
                
                if ($args{penalise}) {
                    $self->penalise();
                }

                if ($args{detach} and $self->score <= 0) {
                    my $t = $self->detach();

                    if (defined $t
                        and $t->isa("HTML::Element")) {
                        $t->normalize_content();
                    }
                }

                return $self->score;
            } else {
                carp "unknown ref type passed to add_scores()\n";
                return undef;
            }
        } else { # $self is not a reference
            return length $self;
        }
    };

    $sub->($self);
}

# Punish content-obsuring nodes, reward content-rich nodes.
# TODO: make this use HTML::Tagset TODO: make this sub 
# changable at run-time to suit specific sites.
sub HTML::Element::penalise {
    my $self = shift;

    my $tag = $self->tag;
    my $score = $self->score;

    ## These elements are considered Just Plain Evil. They
    ## almost always obscure content.
    if ($tag eq 'script' or $tag eq 'span' or $tag eq 'form') {
        # $score = - $score can make a score positive again
        $score = -abs($score);
    } elsif ($tag eq 'p') {
        ## these elements often are, or contain, useful content.
        $score += 50;
    } elsif ($tag eq 'a') {
        ## a tags can be a pain; we could be seeing an
        ## off-site link to supporting documentation,
        ## or we could be seeing a mess of navigation links.
        $score = 1;
    }

    $self->score($score);
}
[download]

In reply to HTML content extractor by Nooks

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.