Bod has asked for the wisdom of the Perl Monks concerning the following question:
I've been searching unsuccessfully for a module to extract just the text from an HTML webpage...
Any suggestions?
Ideally, I want to feed in a URL and return the page's text as plain text - no formatting, tags, etc.
Even most of the text would suffice.
I'm currently using HTML::TreeBuilder and just extracting the p tags which is not quite good enough:
my $http = HTTP::Tiny->new; my $resp = $http->get($url); my $tree = HTML::TreeBuilder->new; $tree->parse($resp->{'content'}); my @paragraph = $tree->look_down('_tag', 'p'); print "Content-type: text/plain\n\n"; foreach my $line(@paragraph) { print $line->as_trimmed_text . "\n"; }
I thought I'd found a solution with HTML::Extract. But when the sample code in the documentation doesn't compile I knew I was heading down a dead end!
Do you know of a module to extract just the text?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Module to extract text from HTML
by marto (Cardinal) on Feb 27, 2024 at 11:29 UTC | |
by Bod (Parson) on Feb 27, 2024 at 19:37 UTC | |
|
Re: Module to extract text from HTML
by hippo (Archbishop) on Feb 27, 2024 at 11:30 UTC | |
by Bod (Parson) on Feb 27, 2024 at 19:35 UTC | |
|
Re: Module to extract text from HTML
by Corion (Patriarch) on Feb 27, 2024 at 11:52 UTC | |
by bliako (Abbot) on Feb 27, 2024 at 21:06 UTC | |
by parv (Parson) on Feb 27, 2024 at 21:22 UTC | |
by bliako (Abbot) on Feb 27, 2024 at 22:04 UTC | |
by bliako (Abbot) on Feb 27, 2024 at 22:36 UTC | |
by afoken (Chancellor) on Feb 28, 2024 at 19:41 UTC | |
by bliako (Abbot) on Feb 29, 2024 at 17:35 UTC | |
|
Re: Module to extract text from HTML
by bliako (Abbot) on Feb 27, 2024 at 22:09 UTC | |
|
Re: Module to extract text from HTML
by parv (Parson) on Feb 27, 2024 at 11:34 UTC | |
by marto (Cardinal) on Feb 27, 2024 at 11:46 UTC | |
by parv (Parson) on Feb 27, 2024 at 18:47 UTC | |
by marto (Cardinal) on Feb 28, 2024 at 17:06 UTC | |
by Bod (Parson) on Feb 27, 2024 at 19:45 UTC | |
by bliako (Abbot) on Feb 28, 2024 at 14:15 UTC | |
by Bod (Parson) on Mar 01, 2024 at 15:47 UTC | |
|
Re: Module to extract text from HTML
by kikuchiyo (Hermit) on Feb 29, 2024 at 13:31 UTC | |
by bliako (Abbot) on Feb 29, 2024 at 15:17 UTC | |
by marto (Cardinal) on Feb 29, 2024 at 15:40 UTC | |
by eyepopslikeamosquito (Archbishop) on Feb 29, 2024 at 22:40 UTC | |
by bliako (Abbot) on Feb 29, 2024 at 17:32 UTC | |
by marto (Cardinal) on Feb 29, 2024 at 17:34 UTC | |
| |
by Danny (Chaplain) on Feb 29, 2024 at 15:45 UTC | |
by marto (Cardinal) on Feb 29, 2024 at 15:51 UTC | |
|
Re: Module to extract text from HTML
by perlfan (Parson) on Feb 29, 2024 at 05:05 UTC |