in reply to Module to extract text from HTML
Depending on what text exactly you want (include/exclude stuff in the <head>), you might also get a solution working by running the Mozilla readability library using one of the JS libraries ( JavaScript::QuickJS, JavaScript::Duktape ), or by porting that library to Perl.
Depending on the content, often you can find an RSS feed.
I distinctly remember reading a paper about HTML content extraction, and that did some calculation on the tree structure of the page, and used something like the element with the highest number of direct children of (I think) type p or div, but I can't find that one anymore. This would be something that should be fairly simple to implement using XPath queries.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Module to extract text from HTML
by bliako (Abbot) on Feb 27, 2024 at 21:06 UTC | |
by parv (Parson) on Feb 27, 2024 at 21:22 UTC | |
by bliako (Abbot) on Feb 27, 2024 at 22:04 UTC | |
|
Re^2: Module to extract text from HTML
by bliako (Abbot) on Feb 27, 2024 at 22:36 UTC | |
by afoken (Chancellor) on Feb 28, 2024 at 19:41 UTC | |
by bliako (Abbot) on Feb 29, 2024 at 17:35 UTC |