in reply to Code does Remove Html
But apart from that, you haven't really given enough information about how your script is handling the html data. Are you going line-by-line? (Are there newline characters in the data, and do you use your regex within a while (<>) {} kind of loop?) If so, then the regex will fail if there is a newline between a < and the next >.
Assuming that the html data has all been slurped into a single scalar variable, then your regex fails on tags like:
because it's not allowing for tags that contain a space character. Maybe what you were looking for was something simpler:<font size="-1">
(Note that some tags, like <p>, stand in place of whitespace, because they function as white-space. Just deleting them outright might cause loss of some word boundaries. So replace them with spaces instead.)s/<[^>]+>/ /g;
But again, the simpler regex still won't work if you're treating the data one line at a time and you happen to run into tags like:
<img src="/super/long/path/string/that/the/author/puts/on/a/separate/line. +jpg" alt="missing image" >
|
|---|