in reply to Module to extract text from HTML

Depending on what text exactly you want (include/exclude stuff in the <head>), you might also get a solution working by running the Mozilla readability library using one of the JS libraries ( JavaScript::QuickJS, JavaScript::Duktape ), or by porting that library to Perl.

Depending on the content, often you can find an RSS feed.

I distinctly remember reading a paper about HTML content extraction, and that did some calculation on the tree structure of the page, and used something like the element with the highest number of direct children of (I think) type p or div, but I can't find that one anymore. This would be something that should be fairly simple to implement using XPath queries.

Replies are listed 'Best First'.
Re^2: Module to extract text from HTML
by bliako (Abbot) on Feb 27, 2024 at 21:06 UTC

    What? No WWW::Mechanize::Chrome?

    use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; my %default_mech_params = ( headless => 1, launch_arg => [ '--window-size=600x800', '--password-store=basic', # do not ask me for stupid chrome ac +count password '--disable-gpu', '--ignore-certificate-errors', '--disable-background-networking', '--disable-client-side-phishing-detection', '--disable-component-update', '--disable-hang-monitor', '--disable-save-password-bubble', '--disable-default-apps', '--disable-infobars', '--disable-popup-blocking', ], ); my $mech = WWW::Mechanize::Chrome->new(%default_mech_params); $mech->get('https://perlmonks.org/?node_id=11157915'); $mech->sleep(5); my $text_string = $mech->content( format => 'text' ); print $text_string;

    bw, bliako

      What I like about the response are the various (interesting) disable options for launch_arg.

        disable options for launch_arg

        Curiously, google-chrome --help does not mention them. If they are not supported at the command line they are surely supported via WWW::Mechanize::Chrome.

Re^2: Module to extract text from HTML
by bliako (Abbot) on Feb 27, 2024 at 22:36 UTC

    And here is the long-winded road of using the mech to save to PDF and then use pdftotext (linux command line) to extract the text (all mixed up and good luck):

    ... my $pdf_data = $mech->content_as_pdf( format => 'A0' ); open(my $fh, '>:raw', 'the.pdf') or die $!; print $fh $pdf_data; close $fh; `pdftotext 'the.pdf'`;

    Note that 'A0' paper size ...

      And here is the long-winded road of using the mech to save to PDF and then use pdftotext

      I'm still waiting for someone to suggest printing out, scanning back in, doing OCR, and have an AI fix the OCR errors. ;-)

      Also, no traces of "just use a regex" so far. Which is really good.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        I am surprised nobody has mentioned that this is an XY problem (X=I want to extract text from html, Y=I want to extract *organisation description text* from html. XY problem, XY solutions.