in reply to Re: Module to extract text from HTML
in thread Module to extract text from HTML

your post reminded me that there is also lynx (https://lynx.invisible-island.net/) (a text-based web-browser) and CPAN module HTML::FormatText::Lynx which spawns a lynx and passes it an html filename or string.

Replies are listed 'Best First'.
Re^3: Module to extract text from HTML
by marto (Cardinal) on Feb 29, 2024 at 15:40 UTC

    You've inspired a reverse golf challenge, ignore all simple, portable solutions, what's the most convoluted way to achieve the goal :)

      fair enough. But the problem of converting html to text can be solved with varied success especially if heuristics are applied, so the more options the better. That's why I keep adding to the list, though the mech-to-pdf was more joking than solving.

        Indeed, and my comment wasn't intended as a criticism, rather an opportunity/idea of the inverse golf/Rube Goldberg solution to problems. In so much that code golfing is an exercise, as is a needlessly convoluted one that generates a suitable response.

      my $text = `lynx -nolist -dump 'https://www.perlmonks.org/?node_id=11157915'` :)

        That's far too Effient. The purpose of such a challenge is to deliberately make it convoluted. Think Rube_Goldberg_machine. In real terms, not everyone has lynx, not everyone can install it on their web host.

        Update: added link.