Re: Looking for a module that strips an HTML tag and its associated 'TEXT'
by marto (Cardinal) on Jul 29, 2020 at 13:27 UTC
|
| [reply] |
|
| [reply] |
|
quite the dependency chain you added just for this.
| [reply] |
|
|
|
Re: Looking for a module that strips an HTML tag and its associated 'TEXT'
by jcb (Parson) on Jul 30, 2020 at 02:08 UTC
|
I would suggest HTML::Parser and a simple state machine, but it will be more than two or three lines. You might even be able to play some tricks with the ignore_elements, ignore_tags, report_tags, and skipped_text features to make the XS code do most of the filtering work. Then the handler callbacks simply print or discard the text as needed, or you can have the XS code stuff a parse trace into an array and use that later in your program.
| [reply] [d/l] [select] |
Re: Looking for a module that strips an HTML tag and its associated 'TEXT'
by perlfan (Vicar) on Jul 29, 2020 at 14:56 UTC
|
hxpipe (1) - convert XML to a format easier to parse with Perl or AWK
Here's the full list, which includes tools that support selecting elements.
cexport (1) - create headerfile of exported declarations from a C file
hxaddid (1) - add ID's to selected elements
hxcite (1) - replace bibliographic references by hyperlinks
hxcite-mkbib (1) - expand references and create bibliography
hxcopy (1) - copy an HTML file while preserving relative links
hxcount (1) - count elements and attributes in HTML or XML files
hxextract (1) - extract selected elements
hxclean (1) - apply heuristics to correct an HTML file
hxprune (1) - remove marked elements from an HTML file
hxincl (1) - expand included HTML or XML files
hxindex (1) - create an alphabetically sorted index
hxmkbib (1) - create bibliography from a template
hxmultitoc (1) - create a table of contents for a set of HTML files
hxname2id - move some ID= or NAME= from A elements to their parents
hxnormalize (1) - pretty-print an HTML file
hxnum (1) - number section headings in an HTML file
hxpipe (1) - convert XML to a format easier to parse with Perl or AWK
hxprintlinks (1) - number links & add table of URLs at end of an HTML file
hxremove (1) - remove selected elements from an XML file
hxtabletrans (1) - transpose an HTML or XHTML table
hxtoc (1) - insert a table of contents in an HTML file
hxuncdata (1) - replace CDATA sections by character entities
hxunent (1) - replace HTML predefined character entities to UTF-8
hxunpipe (1) - convert output of pipe back to XML format
hxunxmlns (1) - replace "global names" by XML Namespace prefixes
hxwls (1) - list links in an HTML file
hxxmlns (1) - replace XML Namespace prefixes by "global names"
asc2xml, xml2asc (1) - convert between UTF8 and nnn; entities
hxref (1) - generate cross-references
hxselect (1) - extract elements that match a (CSS) selector
And FWIW, Sphinx also provides HTML stripping. Not sure how you'd use it, but it can be done when ingesting data for indexing. | [reply] [d/l] |
|
| [reply] |
|
| [reply] |
|
Just sharing. I've been surprised at how many people find this incredibly useful. Also, there's nothing unappealing about Perl programs that call external programs unless those programs are also Perl. Not sure when it became discouraged to use Perl for one of the reasons it was originally created on Unix systems.
| [reply] |
|
|