in reply to Regex to match first html tag previous to text
The most robust solution is to use HTML::Parser. Regexes are not recommended for parsing HTML although they can be made to work and it is probably a good learning experience.
The . operator should generally be your last choice in a regex. To stay within a tag I would suggest something like this (untested):
s!<a[^>]+href=['"]email[^>]+>[^<]+</a\>!DELETED!gi; # which becomes this to deal with whitespace issues s!<\s*a[^>]+href\s*=\s*['"]\s*email[^>]+>[^<]+<\s*/a\s*>!DELETED!gi;
The key thing we are doing is using the NOT class syntax on the > and < parts of the tags to ensure we match everything but still remain reliably in the tag. The endless \s* are required to deal with the relaxed way HTML deals with whitespace.
|
|---|