Re: HTML content extractor

Replies are listed 'Best First'.
Re: Re: HTML content extractor by Nooks (Monk) on Feb 11, 2001 at 18:19 UTC
If all you want to do is extract the text content from an HTML document, you can use YAPE::HTML like so: Yes, and if extracting text was all I wanted to do, that's how I'd do it. The point of this CUFP is to extract content---important text that would appear in a rendered HTML page---as opposed to non-content, such as the comments, the javascript, the unnecessary tags and other fluff, which can't be reliably removed without some idea of the document structure, which is readily available with a parse tree or similar but not with a simple variation on HTML::Parser which can't easily provide some context or easy document manipulation. Usually, a parse tree would be readily available through a DOM or XSLT, or a DTD or something, but most HTML is not written well enough to manipulate this way, so I'm using HTML::TreeBuilder to create the parse tree for me, since it provides excellent support for parsing ambiguous elements like a browser would. Obviously I am not communicating my idea well, or this code is not as good as I think it is, or something. To try to alleviate this problem, I'll include the POD for the program here: =head1 NAME html-extract.pl - extract the content from a HTML page =cut =head1 SYNOPSIS $ perl html-extract foo.html >\| newfoo.html $ w3m -dump newfoo.html =cut =head1 DESCRIPTION F<html-extract.pl> works by reading the file named as its argument (or `index.html') and creating a F<HTML::TreeBuilder> parse tree from it. Then, using some added methods to F<HTML::Element>, the program searches the tree for the `best' node (currently defined as deepest, highest-scoring node). Nodes are scored very simplistically---a node's score is the sum of all the scores of its contents; the score of a text element is its length. Some nodes are penalised for being obfuscatory, others are rewarded for being traditionally associated with content. Any node that scores negatively is automatically deleted from the parse tree. After finding the best node, the head tag is preserved, the body tag's contents removed and replaced with the aforementioned best node. The parse tree is then printed as HTML to standard output. =cut =head1 CAVEATS =over 4 =item o The software is not well-tested; it worked on slashdot and a CNN story page when the author tried it. =item o There is no way to customise the behaviour of the software except to edit the source code. =back =cut =head1 COPYRIGHT Copyright 2001 Jason Henry Parker This program is Free Software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut =head1 SEE ALSO L<HTML::Element>; L<HTML::TreeBuilder>. =cut [download] For anyone still interested in looking at the output of the program, I recommend either the lynx or w3m text browsers, which will render as text to a terminal or tty if passed the -dump argument.	[reply] [d/l]

Replies are listed 'Best First'.

Re: Re: HTML content extractor
by Nooks (Monk) on Feb 11, 2001 at 18:19 UTC

If all you want to do is extract the text content from an HTML document, you can use YAPE::HTML like so:

Yes, and if extracting text was all I wanted to do, that's how I'd do it.

The point of this CUFP is to extract content---important text that would appear in a rendered HTML page---as opposed to non-content, such as the comments, the javascript, the unnecessary tags and other fluff, which can't be reliably removed without some idea of the document structure, which is readily available with a parse tree or similar but not with a simple variation on HTML::Parser which can't easily provide some context or easy document manipulation.

Usually, a parse tree would be readily available through a DOM or XSLT, or a DTD or something, but most HTML is not written well enough to manipulate this way, so I'm using HTML::TreeBuilder to create the parse tree for me, since it provides excellent support for parsing ambiguous elements like a browser would.

Obviously I am not communicating my idea well, or this code is not as good as I think it is, or something. To try to alleviate this problem, I'll include the POD for the program here:

=head1 NAME

html-extract.pl - extract the content from a HTML page

=cut

=head1 SYNOPSIS

    $ perl html-extract foo.html >| newfoo.html
    $ w3m -dump newfoo.html

=cut

=head1 DESCRIPTION

F<html-extract.pl> works by reading the file named
as its argument (or `index.html') and creating a
F<HTML::TreeBuilder> parse tree from it. Then, using some
added methods to F<HTML::Element>, the program searches the
tree for the `best' node (currently defined as deepest,
highest-scoring node).

Nodes are scored very simplistically---a node's score is the
sum of all the scores of its contents; the score of a text
element is its length. Some nodes are penalised for being
obfuscatory, others are rewarded for being traditionally
associated with content. Any node that scores negatively is
automatically deleted from the parse tree.

After finding the best node, the head tag is preserved,
the body tag's contents removed and replaced with the
aforementioned best node.

The parse tree is then printed as HTML to standard output.

=cut

=head1 CAVEATS

=over 4

=item o

The software is not well-tested; it worked on slashdot and a
CNN story page when the author tried it.

=item o

There is no way to customise the behaviour of the software
except to edit the source code.

=back

=cut

=head1 COPYRIGHT

Copyright 2001 Jason Henry Parker

This program is Free Software; you can redistribute it
and/or modify it under the same terms as Perl itself.

=cut

=head1 SEE ALSO

L<HTML::Element>; L<HTML::TreeBuilder>.

=cut
[download]

For anyone still interested in looking at the output of the program, I recommend either the lynx or w3m text browsers, which will render as text to a terminal or tty if passed the -dump argument.

[reply]
[d/l]