in reply to Re^2: Module to extract text from HTML
in thread Module to extract text from HTML

And here is the long-winded road of using the mech to save to PDF and then use pdftotext

I'm still waiting for someone to suggest printing out, scanning back in, doing OCR, and have an AI fix the OCR errors. ;-)

Also, no traces of "just use a regex" so far. Which is really good.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^4: Module to extract text from HTML
by bliako (Abbot) on Feb 29, 2024 at 17:35 UTC

    I am surprised nobody has mentioned that this is an XY problem (X=I want to extract text from html, Y=I want to extract *organisation description text* from html. XY problem, XY solutions.