Re: Module to extract text from HTML

Replies are listed 'Best First'.
Re^2: Module to extract text from HTML by marto (Cardinal) on Feb 27, 2024 at 11:46 UTC
Mojo::DOM is a parser which makes this trivial, however I get the impression from question that it's less about selecting a particular parts of the page ('just extracting the p tags which is not quite good enough'), and more about 'all' of the text.	[reply]
Re^3: Module to extract text from HTML by parv (Parson) on Feb 27, 2024 at 18:47 UTC
(hmm) Right. Yeah I missed that. (edited) Or, may be not. So `qx[w3m url]`? Anyway, I took a superficial look at dependencies of Mojolicious, did not see external parser like. Does it implement the parsing itself? Yes, yes it does.	[reply] [d/l]
Re^4: Module to extract text from HTML by marto (Cardinal) on Feb 28, 2024 at 17:06 UTC
It's fantastic, powerful css selectors, stand alone or from Mojo::UserAgent results. See also ojo - Fun one-liners with Mojo, super search for some interesting use cases, and if you do web development Mojolicious::Lite/Mojolicious.	[reply]
Re^3: Module to extract text from HTML by Bod (Parson) on Feb 27, 2024 at 19:45 UTC
I get the impression from question that it's less about selecting a particular parts of the page We already hold the website of our customers (typically UK charities). We want them to complete a short section about their organisation. This is used to construct prompts for AI tools around our site that they use to streamline their workload. I am trying to make it easier for them to complete the description of their organisation by pulling text from their own website. This will give them something to work with instead of having to begin with a blank canvas (or contenteditable div).	[reply]
Re^4: Module to extract text from HTML by bliako (Abbot) on Feb 28, 2024 at 14:15 UTC
If I understood correctly that you are in control of websites and the formatting of their content, perhaps you could add some tags to the content by means of html comments or, better, custom attributes for html tags `<p "data-purpose"="description" "data-index"="1">blah blav</p>` and then you just reconstruct the text content from html.	[reply] [d/l]
Re^5: Module to extract text from HTML by Bod (Parson) on Mar 01, 2024 at 15:47 UTC