in reply to Re: Module to extract text from HTML
in thread Module to extract text from HTML
And here is the long-winded road of using the mech to save to PDF and then use pdftotext (linux command line) to extract the text (all mixed up and good luck):
... my $pdf_data = $mech->content_as_pdf( format => 'A0' ); open(my $fh, '>:raw', 'the.pdf') or die $!; print $fh $pdf_data; close $fh; `pdftotext 'the.pdf'`;
Note that 'A0' paper size ...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Module to extract text from HTML
by afoken (Chancellor) on Feb 28, 2024 at 19:41 UTC | |
by bliako (Abbot) on Feb 29, 2024 at 17:35 UTC |