Re: Module to extract text from HTML

Depending on what text exactly you want (include/exclude stuff in the <head>), you might also get a solution working by running the Mozilla readability library using one of the JS libraries ( JavaScript::QuickJS, JavaScript::Duktape ), or by porting that library to Perl.

Depending on the content, often you can find an RSS feed.

I distinctly remember reading a paper about HTML content extraction, and that did some calculation on the tree structure of the page, and used something like the element with the highest number of direct children of (I think) type p or div, but I can't find that one anymore. This would be something that should be fairly simple to implement using XPath queries.

Comment on Re: Module to extract text from HTML Select or Download Code

Replies are listed 'Best First'.
Re^2: Module to extract text from HTML by bliako (Abbot) on Feb 27, 2024 at 21:06 UTC
What? No WWW::Mechanize::Chrome? use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; my %default_mech_params = ( headless => 1, launch_arg => [ '--window-size=600x800', '--password-store=basic', # do not ask me for stupid chrome ac +count password '--disable-gpu', '--ignore-certificate-errors', '--disable-background-networking', '--disable-client-side-phishing-detection', '--disable-component-update', '--disable-hang-monitor', '--disable-save-password-bubble', '--disable-default-apps', '--disable-infobars', '--disable-popup-blocking', ], ); my $mech = WWW::Mechanize::Chrome->new(%default_mech_params); $mech->get('https://perlmonks.org/?node_id=11157915'); $mech->sleep(5); my $text_string = $mech->content( format => 'text' ); print $text_string; [download] bw, bliako	[reply] [d/l]
Re^3: Module to extract text from HTML by parv (Parson) on Feb 27, 2024 at 21:22 UTC
What I like about the response are the various (interesting) `disable` options for `launch_arg`.	[reply] [d/l] [select]
Re^4: Module to extract text from HTML by bliako (Abbot) on Feb 27, 2024 at 22:04 UTC
`disable options for launch_arg` Curiously, `google-chrome --help` does not mention them. If they are not supported at the command line they are surely supported via WWW::Mechanize::Chrome.	[reply] [d/l] [select]
Re^2: Module to extract text from HTML by bliako (Abbot) on Feb 27, 2024 at 22:36 UTC
And here is the long-winded road of using the mech to save to PDF and then use `pdftotext` (linux command line) to extract the text (all mixed up and good luck): ... my $pdf_data = $mech->content_as_pdf( format => 'A0' ); open(my $fh, '>:raw', 'the.pdf') or die $!; print $fh $pdf_data; close $fh; `pdftotext 'the.pdf'`; [download] Note that 'A0' paper size ...	[reply] [d/l] [select]
Re^3: Module to extract text from HTML by afoken (Chancellor) on Feb 28, 2024 at 19:41 UTC
And here is the long-winded road of using the mech to save to PDF and then use pdftotext I'm still waiting for someone to suggest printing out, scanning back in, doing OCR, and have an AI fix the OCR errors. ;-) Also, no traces of "just use a regex" so far. Which is really good. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^4: Module to extract text from HTML by bliako (Abbot) on Feb 29, 2024 at 17:35 UTC
I am surprised nobody has mentioned that this is an XY problem (X=I want to extract text from html, Y=I want to extract organisation description text from html. XY problem, XY solutions.	[reply]