Re: Code does Remove Html

If you are trying to use a regex to remove tags from html data, the data is not "being parsed" in the proper sense of the term -- parsing is what HTML::TokeParser does.

But apart from that, you haven't really given enough information about how your script is handling the html data. Are you going line-by-line? (Are there newline characters in the data, and do you use your regex within a while (<>) {} kind of loop?) If so, then the regex will fail if there is a newline between a < and the next >.

Assuming that the html data has all been slurped into a single scalar variable, then your regex fails on tags like:

<font size="-1">
[download]

because it's not allowing for tags that contain a space character. Maybe what you were looking for was something simpler:

s/<[^>]+>/ /g;
[download]

(Note that some tags, like <p>, stand in place of whitespace, because they function as white-space. Just deleting them outright might cause loss of some word boundaries. So replace them with spaces instead.)

But again, the simpler regex still won't work if you're treating the data one line at a time and you happen to run into tags like:

<img
 src="/super/long/path/string/that/the/author/puts/on/a/separate/line.
+jpg"
 alt="missing image"
>
[download]

Comment on Re: Code does Remove Html Select or Download Code