Good point about "\d". As I stated, I assumed the original pattern was correct. I did so because I didn't look up what an valid entity could be.
As for the optional ";", the rules are hidden in the SGML spec. Perhaps it would make sense to add the ";" if it's missing (using s/(&[a-zA-Z]++)(?!;)/$1;/g;).
Going by your description of what is valid, using \b is incorrect. \w matches more than letters, even without unicode semantics. That's easily fixed by simplifying "(?![a-zA-Z]++(?:;|\b))" to "(?![a-zA-Z])".
Also, "#" is missing in your pattern, and you have an extra ")".
Fix:
s/&(?!\#(?>x[0-9a-fA-F]+|[0-9]+);|[a-zA-Z])/&/g;
By the way, I used (?>) instead of the possessive quantifier since the former dates back to at least 5.6, whereas the latter was introduced in 5.10.
In reply to Re^3: Matching ampersands that are NOT part of an HTML entity?
by ikegami
in thread Matching ampersands that are NOT part of an HTML entity?
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |