in reply to Module to extract text from HTML

If you can use external programs instead of just Perl modules, try html2text. It does exactly what you want. Be warned though: there are (at least) two different programs with the same name, with different options.

Replies are listed 'Best First'.
Re^2: Module to extract text from HTML
by bliako (Abbot) on Feb 29, 2024 at 15:17 UTC

    your post reminded me that there is also lynx (https://lynx.invisible-island.net/) (a text-based web-browser) and CPAN module HTML::FormatText::Lynx which spawns a lynx and passes it an html filename or string.

      You've inspired a reverse golf challenge, ignore all simple, portable solutions, what's the most convoluted way to achieve the goal :)

        fair enough. But the problem of converting html to text can be solved with varied success especially if heuristics are applied, so the more options the better. That's why I keep adding to the list, though the mech-to-pdf was more joking than solving.

        my $text = `lynx -nolist -dump 'https://www.perlmonks.org/?node_id=11157915'` :)