Re: Module to extract text from HTML

Replies are listed 'Best First'.
Re^2: Module to extract text from HTML by bliako (Abbot) on Feb 29, 2024 at 15:17 UTC
your post reminded me that there is also lynx (https://lynx.invisible-island.net/) (a text-based web-browser) and CPAN module HTML::FormatText::Lynx which spawns a lynx and passes it an html filename or string.	[reply]
Re^3: Module to extract text from HTML by marto (Cardinal) on Feb 29, 2024 at 15:40 UTC
You've inspired a reverse golf challenge, ignore all simple, portable solutions, what's the most convoluted way to achieve the goal :)	[reply]
Re^4: Module to extract text from HTML (Reverse Golf) by eyepopslikeamosquito (Archbishop) on Feb 29, 2024 at 22:40 UTC
You've inspired a reverse golf challenge, ignore all simple, portable solutions, what's the most convoluted way to achieve the goal :) Interesting. As a keen student of the lighter side of programming culture, I'm unsure if reverse golf warrants a new distinct category, or if it's just a sub-category of Obfu. Opinions welcome. Historically, I believe Golf precedes Obfu because John McCarthy grad students played golf ("program bumming") in machine language on an IBM 704 computer in 1959 (APL golf was also popular in the 1960s) ... while the earliest reference to Obfu I'm aware of is 1972, when Messrs Woods and Lyon implemented INTERCAL on an IBM 360. The first International Obfuscated C Code Contest came much later, in 1984 (I see Larry Wall won this contest twice!). See Also The Lighter Side of Perl Culture (Part III): Obfu The Lighter Side of Perl Culture (Part IV): Golf Rube Goldberg machine Updated: Noted that Larry Wall won International Obfuscated C contest twice. Added link to Rube Goldberg machine. 👁️🍾👍🦟	[reply]
Re^4: Module to extract text from HTML by bliako (Abbot) on Feb 29, 2024 at 17:32 UTC
fair enough. But the problem of converting html to text can be solved with varied success especially if heuristics are applied, so the more options the better. That's why I keep adding to the list, though the mech-to-pdf was more joking than solving.	[reply]
Re^5: Module to extract text from HTML by marto (Cardinal) on Feb 29, 2024 at 17:34 UTC
Re^6: Module to extract text from HTML by bliako (Abbot) on Feb 29, 2024 at 17:37 UTC
Re^4: Module to extract text from HTML by Danny (Chaplain) on Feb 29, 2024 at 15:45 UTC
my $text = `lynx -nolist -dump 'https://www.perlmonks.org/?node_id=11157915'` :)	[reply] [d/l]
Re^5: Module to extract text from HTML by marto (Cardinal) on Feb 29, 2024 at 15:51 UTC