Okay, I had this program for work that would go through each file, recursing down the directory, that would take it in, grab info from it, do a few things, such as a word count, which needed the HTML gone, and output some SQL. And deciding that there is no real good regex method to strip HTML out (that would allow me to sleep at night), I decided to go with what is listed in the cookbook., with a slight modification, as listed below:
$clean = HTML::FormatText->new->format(parse_html($html)) if ($html = +~ m/<[^>]+>/);

Basically, it will only strip out the HTML, if it has some semblance of a HTML tag. To check if this worked as anticipated, I did some profiles and benchmarking, and found it speed up the script on documents that had NO html from 2 minutes to run, to 30 seconds. (Sorry, this was a few months ago, and don't have the results of the profile anymore.) I also ran this on some files that were mixed, and found it speed it up from 1 minute to 45 seconds. Not as huge of an increase as the other, but it works.

I then learned a great skill, why munge data, when it is not needed, and in this case, load up HTML::FormatText and HTML::Parse.


In reply to Stripping HTML by lshatzer

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.