in reply to Matching ampersands that are NOT part of an HTML entity?

Going on the assumption that your pattern is correct,
s/ & (?! (?: # (?: x[\da-f]+ | \d+ ) | [a-z]+ ) ; ) /&/xi

I factored out the "#" and removed extraneous captures and groupings, but the key is (?!)

Update: And if you wanted to only accept known entities,

local our %known = map { $_ => 1 } qw( eacute Eacute ecirc Ecirc ... ); s/ & (?! (?: \# (?: x[\da-f]+ | \d+ ) | ([a-z]+) (?(?{ !$known{$1} }) (?!) ) ) ; ) /&/xi

or

use Regexp::List qw( ); my @known = qw( eacute Eacute ecirc Ecirc ... ); my $known = Regexp::List->new()->list2re(@known); s/ & (?! (?: \# (?: x[\da-f]+ | \d+ ) | $known ) ; ) /&/xi

Update: Escaped "#" as per reply.

Replies are listed 'Best First'.
Re^2: Matching ampersands that are NOT part of an HTML entity?
by AnomalousMonk (Archbishop) on Aug 07, 2008 at 00:19 UTC
    Shouldn't that be \# or [#] in an extended (ie., /x) regex, otherwise a naked # begins a comment-to-end-of-line?
Re^2: Matching ampersands that are NOT part of an HTML entity?
by JavaFan (Canon) on Aug 07, 2008 at 12:02 UTC
    To be pedantic, don't use \d as a substitute for [0-9]. \d matches a lot more than just western digits, it matches digits in many other scripts as well, but those aren't legal in numeric HTML entities.

    Furthermore, the ';' is optional if a named entity is used, and isn't followed by other letters. So, I'd use:

    /&(?![a-zA-Z]++(?:;|\b)|x[0-9a-fA-F]++;|[0-9]++;))/

      Good point about "\d". As I stated, I assumed the original pattern was correct. I did so because I didn't look up what an valid entity could be.

      As for the optional ";", the rules are hidden in the SGML spec. Perhaps it would make sense to add the ";" if it's missing (using s/(&[a-zA-Z]++)(?!;)/$1;/g;).

      Going by your description of what is valid, using \b is incorrect. \w matches more than letters, even without unicode semantics. That's easily fixed by simplifying "(?![a-zA-Z]++(?:;|\b))" to "(?![a-zA-Z])".

      Also, "#" is missing in your pattern, and you have an extra ")".

      Fix:

      s/&(?!\#(?>x[0-9a-fA-F]+|[0-9]+);|[a-zA-Z])/&/g;

      By the way, I used (?>) instead of the possessive quantifier since the former dates back to at least 5.6, whereas the latter was introduced in 5.10.