in reply to Re: Matching ampersands that are NOT part of an HTML entity?
in thread Matching ampersands that are NOT part of an HTML entity?

To be pedantic, don't use \d as a substitute for [0-9]. \d matches a lot more than just western digits, it matches digits in many other scripts as well, but those aren't legal in numeric HTML entities.

Furthermore, the ';' is optional if a named entity is used, and isn't followed by other letters. So, I'd use:

/&(?![a-zA-Z]++(?:;|\b)|x[0-9a-fA-F]++;|[0-9]++;))/

Replies are listed 'Best First'.
Re^3: Matching ampersands that are NOT part of an HTML entity?
by ikegami (Patriarch) on Aug 07, 2008 at 12:57 UTC

    Good point about "\d". As I stated, I assumed the original pattern was correct. I did so because I didn't look up what an valid entity could be.

    As for the optional ";", the rules are hidden in the SGML spec. Perhaps it would make sense to add the ";" if it's missing (using s/(&[a-zA-Z]++)(?!;)/$1;/g;).

    Going by your description of what is valid, using \b is incorrect. \w matches more than letters, even without unicode semantics. That's easily fixed by simplifying "(?![a-zA-Z]++(?:;|\b))" to "(?![a-zA-Z])".

    Also, "#" is missing in your pattern, and you have an extra ")".

    Fix:

    s/&(?!\#(?>x[0-9a-fA-F]+|[0-9]+);|[a-zA-Z])/&/g;

    By the way, I used (?>) instead of the possessive quantifier since the former dates back to at least 5.6, whereas the latter was introduced in 5.10.