Re: Regex to match first html tag previous to text

The most robust solution is to use HTML::Parser. Regexes are not recommended for parsing HTML although they can be made to work and it is probably a good learning experience.

The . operator should generally be your last choice in a regex. To stay within a tag I would suggest something like this (untested):

s!<a[^>]+href=['"]email[^>]+>[^<]+</a\>!DELETED!gi;

# which becomes this to deal with whitespace issues

s!<\s*a[^>]+href\s*=\s*['"]\s*email[^>]+>[^<]+<\s*/a\s*>!DELETED!gi;
[download]

The key thing we are doing is using the NOT class syntax on the > and < parts of the tags to ensure we match everything but still remain reliably in the tag. The endless \s* are required to deal with the relaxed way HTML deals with whitespace.

Comment on Re: Regex to match first html tag previous to text Download Code