in reply to Re^2: Matching ampersands that are NOT part of an HTML entity?
in thread Matching ampersands that are NOT part of an HTML entity?
Good point about "\d". As I stated, I assumed the original pattern was correct. I did so because I didn't look up what an valid entity could be.
As for the optional ";", the rules are hidden in the SGML spec. Perhaps it would make sense to add the ";" if it's missing (using s/(&[a-zA-Z]++)(?!;)/$1;/g;).
Going by your description of what is valid, using \b is incorrect. \w matches more than letters, even without unicode semantics. That's easily fixed by simplifying "(?![a-zA-Z]++(?:;|\b))" to "(?![a-zA-Z])".
Also, "#" is missing in your pattern, and you have an extra ")".
Fix:
s/&(?!\#(?>x[0-9a-fA-F]+|[0-9]+);|[a-zA-Z])/&/g;
By the way, I used (?>) instead of the possessive quantifier since the former dates back to at least 5.6, whereas the latter was introduced in 5.10.
|
|---|