Re^2: Matching ampersands that are NOT part of an HTML entity?

To be pedantic, don't use \d as a substitute for [0-9]. \d matches a lot more than just western digits, it matches digits in many other scripts as well, but those aren't legal in numeric HTML entities.

Furthermore, the ';' is optional if a named entity is used, and isn't followed by other letters. So, I'd use:

  /&(?![a-zA-Z]++(?:;|\b)|x[0-9a-fA-F]++;|[0-9]++;))/
[download]

Comment on Re^2: Matching ampersands that are NOT part of an HTML entity? Select or Download Code

Replies are listed 'Best First'.
Re^3: Matching ampersands that are NOT part of an HTML entity? by ikegami (Patriarch) on Aug 07, 2008 at 12:57 UTC
Good point about "`\d`". As I stated, I assumed the original pattern was correct. I did so because I didn't look up what an valid entity could be. As for the optional "`;`", the rules are hidden in the SGML spec. Perhaps it would make sense to add the "`;`" if it's missing (using `s/(&[a-zA-Z]++)(?!;)/$1;/g;`). Going by your description of what is valid, using `\b` is incorrect. `\w` matches more than letters, even without unicode semantics. That's easily fixed by simplifying "`(?![a-zA-Z]++(?:;\|\b))`" to "`(?![a-zA-Z])`". Also, "`#`" is missing in your pattern, and you have an extra "`)`". Fix: `s/&(?!\#(?>x[0-9a-fA-F]+\|[0-9]+);\|[a-zA-Z])/&/g;` [download] By the way, I used `(?>)` instead of the possessive quantifier since the former dates back to at least 5.6, whereas the latter was introduced in 5.10.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Matching ampersands that are NOT part of an HTML entity?
by ikegami (Patriarch) on Aug 07, 2008 at 12:57 UTC

Good point about "\d". As I stated, I assumed the original pattern was correct. I did so because I didn't look up what an valid entity could be.

As for the optional ";", the rules are hidden in the SGML spec. Perhaps it would make sense to add the ";" if it's missing (using s/(&[a-zA-Z]++)(?!;)/$1;/g;).

Going by your description of what is valid, using \b is incorrect. \w matches more than letters, even without unicode semantics. That's easily fixed by simplifying "(?![a-zA-Z]++(?:;|\b))" to "(?![a-zA-Z])".

Also, "#" is missing in your pattern, and you have an extra ")".

Fix:

s/&(?!\#(?>x[0-9a-fA-F]+|[0-9]+);|[a-zA-Z])/&amp;/g;
[download]

By the way, I used (?>) instead of the possessive quantifier since the former dates back to at least 5.6, whereas the latter was introduced in 5.10.

[reply]
[d/l]
[select]