Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Looking for a module that strips an HTML tag and its associated 'TEXT'

by nysus (Parson)
on Jul 29, 2020 at 13:24 UTC ( [id://11119965]=perlquestion: print w/replies, xml ) Need Help??

nysus has asked for the wisdom of the Perl Monks concerning the following question:

I'm having a surprisingly hard time finding a module to strip out a given HTML tag along with the text it contains. HTML::TagFilter doesn't strip the text. HTML::Strip does not allow you to filter out only one kind of HTML tag (the <p>) tag for instance (it looks like it might do this but it doesn't, I tried). Looking for something nice and simple that doesn't rely on regexes.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

  • Comment on Looking for a module that strips an HTML tag and its associated 'TEXT'
  • Download Code

Replies are listed 'Best First'.
Re: Looking for a module that strips an HTML tag and its associated 'TEXT'
by marto (Cardinal) on Jul 29, 2020 at 13:27 UTC
        quite the dependency chain you added just for this.
Re: Looking for a module that strips an HTML tag and its associated 'TEXT'
by jcb (Parson) on Jul 30, 2020 at 02:08 UTC

    I would suggest HTML::Parser and a simple state machine, but it will be more than two or three lines. You might even be able to play some tricks with the ignore_elements, ignore_tags, report_tags, and skipped_text features to make the XS code do most of the filtering work. Then the handler callbacks simply print or discard the text as needed, or you can have the XS code stuff a parse trace into an array and use that later in your program.

Re: Looking for a module that strips an HTML tag and its associated 'TEXT'
by perlfan (Vicar) on Jul 29, 2020 at 14:56 UTC
    See Bruce Gray - Refactoring and Readability: Crouching Regex, Hidden Structures for a nice intro into pure Perl/raku options.

    I do not know if there is a CPAN module that interfaces these external tools, but you may be interested in hxpipe that W3C provides. For example,

    hxpipe (1)           - convert XML to a format easier to parse with Perl or AWK
    
    Here's the full list, which includes tools that support selecting elements.
    cexport (1)          - create headerfile of exported declarations from a C file
    hxaddid (1)          - add ID's to selected elements
    hxcite (1)           - replace bibliographic references by hyperlinks
    hxcite-mkbib (1)     - expand references and create bibliography
    hxcopy (1)           - copy an HTML file while preserving relative links
    hxcount (1)          - count elements and attributes in HTML or XML files
    hxextract (1)        - extract selected elements
    hxclean (1)          - apply heuristics to correct an HTML file
    hxprune (1)          - remove marked elements from an HTML file
    hxincl (1)           - expand included HTML or XML files
    hxindex (1)          - create an alphabetically sorted index
    hxmkbib (1)          - create bibliography from a template
    hxmultitoc (1)       - create a table of contents for a set of HTML files
    hxname2id            - move some ID= or NAME= from A elements to their parents
    hxnormalize (1)      - pretty-print an HTML file
    hxnum (1)            - number section headings in an HTML file
    hxpipe (1)           - convert XML to a format easier to parse with Perl or AWK
    hxprintlinks (1)     - number links & add table of URLs at end of an HTML file
    hxremove (1)         - remove selected elements from an XML file
    hxtabletrans (1)     - transpose an HTML or XHTML table
    hxtoc (1)            - insert a table of contents in an HTML file
    hxuncdata (1)        - replace CDATA sections by character entities
    hxunent (1)          - replace HTML predefined character entities to UTF-8
    hxunpipe (1)         - convert output of pipe back to XML format
    hxunxmlns (1)        - replace "global names" by XML Namespace prefixes
    hxwls (1)            - list links in an HTML file
    hxxmlns (1)          - replace XML Namespace prefixes by "global names"
    asc2xml, xml2asc (1) - convert between UTF8 and &#nnn; entities
    hxref (1)            - generate cross-references
    hxselect (1)         - extract elements that match a (CSS) selector
    
    And FWIW, Sphinx also provides HTML stripping. Not sure how you'd use it, but it can be done when ingesting data for indexing.

      A non perl dependency makes this unappealing, when pure perl modules can already achieve this.

        Without too much work, you could create an XS module that just uses this code directly. That way you don't need to exec.

        -Thomas
        "Excuse me for butting in, but I'm interrupt-driven..."
        Just sharing. I've been surprised at how many people find this incredibly useful. Also, there's nothing unappealing about Perl programs that call external programs unless those programs are also Perl. Not sure when it became discouraged to use Perl for one of the reasons it was originally created on Unix systems.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11119965]
Approved by Corion
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-04-20 03:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found