Depending on what text exactly you want (include/exclude stuff in the <head>), you might also get a solution working by running the Mozilla readability library using one of the JS libraries ( JavaScript::QuickJS, JavaScript::Duktape ), or by porting that library to Perl.
Depending on the content, often you can find an RSS feed.
I distinctly remember reading a paper about HTML content extraction, and that did some calculation on the tree structure of the page, and used something like the element with the highest number of direct children of (I think) type p or div, but I can't find that one anymore. This would be something that should be fairly simple to implement using XPath queries.
In reply to Re: Module to extract text from HTML
by Corion
in thread Module to extract text from HTML
by Bod
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |