in reply to How would you extract *content* from websites?

HTML::Strip, for example?

use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html ); $hs->eof;

The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon

Replies are listed 'Best First'.
Re^2: How would you extract *content* from websites?
by Ovid (Cardinal) on Jun 17, 2005 at 18:31 UTC

    The problem is that this is going to leave a lot of "non content" data such as menu link names, possible advertising text, etc. While it's a very poor guide, HTML can serve as "metadata" that allows you to navigate to the actual content. Remove that before getting to your content and the spider won't be able to make intelligent decisions.

    Cheers,
    Ovid

    New address of my CGI Course.