An approach that has served me well (perfectly so far) is to decode everything to utf8 first and then re-encode it to entities. It works because decoding a plain & is a no-op. So you're essentially normalizing the text and then encoding it. You will lose your original entities but the new ones will probably be better as they will be uniform. I recommend numeric entities; the example shows named ones.
use strict; use warnings; use HTML::Entities; use Encode; for my $line ( <DATA> ) { # This is a no-op on plain &s my $utf8 = HTML::Entities::decode($line); print Encode::encode_utf8($utf8); my $proper = HTML::Entities::encode($utf8); # OR encode_numeric() print $proper; } __DATA__ Purus Accumsan Felis ‰ Maecenas Nibh θ Eget Phasellus & Mi + Amet. Odio Amet && Purus. Mi Ullamcorper Lorem Eget Nibh. http://www.example.com/?name=John&residence=Vatican+City&job=Pope
In reply to Re: Matching ampersands that are NOT part of an HTML entity?
by Your Mother
in thread Matching ampersands that are NOT part of an HTML entity?
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |