in reply to Matching ampersands that are NOT part of an HTML entity?
An approach that has served me well (perfectly so far) is to decode everything to utf8 first and then re-encode it to entities. It works because decoding a plain & is a no-op. So you're essentially normalizing the text and then encoding it. You will lose your original entities but the new ones will probably be better as they will be uniform. I recommend numeric entities; the example shows named ones.
use strict; use warnings; use HTML::Entities; use Encode; for my $line ( <DATA> ) { # This is a no-op on plain &s my $utf8 = HTML::Entities::decode($line); print Encode::encode_utf8($utf8); my $proper = HTML::Entities::encode($utf8); # OR encode_numeric() print $proper; } __DATA__ Purus Accumsan Felis ‰ Maecenas Nibh θ Eget Phasellus & Mi + Amet. Odio Amet && Purus. Mi Ullamcorper Lorem Eget Nibh. http://www.example.com/?name=John&residence=Vatican+City&job=Pope
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Matching ampersands that are NOT part of an HTML entity?
by EvanK (Chaplain) on Aug 07, 2008 at 15:12 UTC | |
by Your Mother (Archbishop) on Aug 07, 2008 at 15:18 UTC |