=head1 NAME html-extract.pl - extract the content from a HTML page =cut =head1 SYNOPSIS $ perl html-extract foo.html >| newfoo.html $ w3m -dump newfoo.html =cut =head1 DESCRIPTION F works by reading the file named as its argument (or `index.html') and creating a F parse tree from it. Then, using some added methods to F, the program searches the tree for the `best' node (currently defined as deepest, highest-scoring node). Nodes are scored very simplistically---a node's score is the sum of all the scores of its contents; the score of a text element is its length. Some nodes are penalised for being obfuscatory, others are rewarded for being traditionally associated with content. Any node that scores negatively is automatically deleted from the parse tree. After finding the best node, the head tag is preserved, the body tag's contents removed and replaced with the aforementioned best node. The parse tree is then printed as HTML to standard output. =cut =head1 CAVEATS =over 4 =item o The software is not well-tested; it worked on slashdot and a CNN story page when the author tried it. =item o There is no way to customise the behaviour of the software except to edit the source code. =back =cut =head1 COPYRIGHT Copyright 2001 Jason Henry Parker This program is Free Software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut =head1 SEE ALSO L; L. =cut