I've been searching unsuccessfully for a module to extract just the text from an HTML webpage...
Any suggestions?
Ideally, I want to feed in a URL and return the page's text as plain text - no formatting, tags, etc.
Even most of the text would suffice.
I'm currently using HTML::TreeBuilder and just extracting the p tags which is not quite good enough:
my $http = HTTP::Tiny->new; my $resp = $http->get($url); my $tree = HTML::TreeBuilder->new; $tree->parse($resp->{'content'}); my @paragraph = $tree->look_down('_tag', 'p'); print "Content-type: text/plain\n\n"; foreach my $line(@paragraph) { print $line->as_trimmed_text . "\n"; }
I thought I'd found a solution with HTML::Extract. But when the sample code in the documentation doesn't compile I knew I was heading down a dead end!
Do you know of a module to extract just the text?
In reply to Module to extract text from HTML by Bod
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |