Looking for a module that strips an HTML tag and its associated 'TEXT'

nysus has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Looking for a module that strips an HTML tag and its associated 'TEXT' by marto (Cardinal) on Jul 29, 2020 at 13:27 UTC
Sounds like you want Mojo::DOM, see Mojo::DOM parsing question, Calling sub-routine in regex, Super Search for more, or post some example data and what you want to get out of it.	[reply]
Re^2: Looking for a module that strips an HTML tag and its associated 'TEXT' by nysus (Parson) on Jul 29, 2020 at 13:31 UTC
I think I just found what I need: https://metacpan.org/pod/HTML::Restrict. Thanks. $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ Vicar"; $nysus = $PM . ' ' . $MCF; Click here if you love Perl Monks	[reply]
Re^3:Looking for a module that strips an HTML tag and its associated 'TEXT' by Anonymous Monk on Jul 29, 2020 at 13:48 UTC
quite the dependency chain you added just for this.	[reply]
Re^4: Looking for a module that strips an HTML tag and its associated 'TEXT' by nysus (Parson) on Jul 29, 2020 at 13:56 UTC
Re^5: Looking for a module that strips an HTML tag and its associated 'TEXT' by marto (Cardinal) on Jul 29, 2020 at 14:02 UTC
Some notes below your chosen depth have not been shown here
Re: Looking for a module that strips an HTML tag and its associated 'TEXT' by jcb (Parson) on Jul 30, 2020 at 02:08 UTC
I would suggest HTML::Parser and a simple state machine, but it will be more than two or three lines. You might even be able to play some tricks with the `ignore_elements`, `ignore_tags`, `report_tags`, and `skipped_text` features to make the XS code do most of the filtering work. Then the handler callbacks simply print or discard the text as needed, or you can have the XS code stuff a parse trace into an array and use that later in your program.	[reply] [d/l] [select]
Re: Looking for a module that strips an HTML tag and its associated 'TEXT' by perlfan (Vicar) on Jul 29, 2020 at 14:56 UTC
See Bruce Gray - Refactoring and Readability: Crouching Regex, Hidden Structures for a nice intro into pure Perl/raku options. I do not know if there is a CPAN module that interfaces these external tools, but you may be interested in `hxpipe` that W3C provides. For example, hxpipe (1) - convert XML to a format easier to parse with Perl or AWK Here's the full list, which includes tools that support selecting elements. cexport (1) - create headerfile of exported declarations from a C file hxaddid (1) - add ID's to selected elements hxcite (1) - replace bibliographic references by hyperlinks hxcite-mkbib (1) - expand references and create bibliography hxcopy (1) - copy an HTML file while preserving relative links hxcount (1) - count elements and attributes in HTML or XML files hxextract (1) - extract selected elements hxclean (1) - apply heuristics to correct an HTML file hxprune (1) - remove marked elements from an HTML file hxincl (1) - expand included HTML or XML files hxindex (1) - create an alphabetically sorted index hxmkbib (1) - create bibliography from a template hxmultitoc (1) - create a table of contents for a set of HTML files hxname2id - move some ID= or NAME= from A elements to their parents hxnormalize (1) - pretty-print an HTML file hxnum (1) - number section headings in an HTML file hxpipe (1) - convert XML to a format easier to parse with Perl or AWK hxprintlinks (1) - number links & add table of URLs at end of an HTML file hxremove (1) - remove selected elements from an XML file hxtabletrans (1) - transpose an HTML or XHTML table hxtoc (1) - insert a table of contents in an HTML file hxuncdata (1) - replace CDATA sections by character entities hxunent (1) - replace HTML predefined character entities to UTF-8 hxunpipe (1) - convert output of pipe back to XML format hxunxmlns (1) - replace "global names" by XML Namespace prefixes hxwls (1) - list links in an HTML file hxxmlns (1) - replace XML Namespace prefixes by "global names" asc2xml, xml2asc (1) - convert between UTF8 and &#nnn; entities hxref (1) - generate cross-references hxselect (1) - extract elements that match a (CSS) selector And FWIW, Sphinx also provides HTML stripping. Not sure how you'd use it, but it can be done when ingesting data for indexing.	[reply] [d/l]
Re^2: Looking for a module that strips an HTML tag and its associated 'TEXT' by marto (Cardinal) on Jul 29, 2020 at 15:02 UTC
A non perl dependency makes this unappealing, when pure perl modules can already achieve this.	[reply]
Re^3: Looking for a module that strips an HTML tag and its associated 'TEXT' by thomas895 (Deacon) on Jul 30, 2020 at 02:01 UTC
Without too much work, you could create an XS module that just uses this code directly. That way you don't need to exec. -Thomas "Excuse me for butting in, but I'm interrupt-driven..."	[reply]
Re^3: Looking for a module that strips an HTML tag and its associated 'TEXT' by perlfan (Vicar) on Jul 29, 2020 at 15:04 UTC
Just sharing. I've been surprised at how many people find this incredibly useful. Also, there's nothing unappealing about Perl programs that call external programs unless those programs are also Perl. Not sure when it became discouraged to use Perl for one of the reasons it was originally created on Unix systems.	[reply]
Re^4: Looking for a module that strips an HTML tag and its associated 'TEXT' by hippo (Bishop) on Jul 29, 2020 at 15:18 UTC
Re^5: Looking for a module that strips an HTML tag and its associated 'TEXT' by perlfan (Vicar) on Jul 29, 2020 at 16:06 UTC


go ahead... be a heretic
	PerlMonks