$clean = HTML::FormatText->new->format(parse_html($html)) if ($html = +~ m/<[^>]+>/);
Basically, it will only strip out the HTML, if it has some semblance of a HTML tag. To check if this worked as anticipated, I did some profiles and benchmarking, and found it speed up the script on documents that had NO html from 2 minutes to run, to 30 seconds. (Sorry, this was a few months ago, and don't have the results of the profile anymore.) I also ran this on some files that were mixed, and found it speed it up from 1 minute to 45 seconds. Not as huge of an increase as the other, but it works.
I then learned a great skill, why munge data, when it is not needed, and in this case, load up HTML::FormatText and HTML::Parse.
|
|---|