Re: Matching ampersands that are NOT part of an HTML entity?

An approach that has served me well (perfectly so far) is to decode everything to utf8 first and then re-encode it to entities. It works because decoding a plain & is a no-op. So you're essentially normalizing the text and then encoding it. You will lose your original entities but the new ones will probably be better as they will be uniform. I recommend numeric entities; the example shows named ones.

use strict;
use warnings;

use HTML::Entities;
use Encode;

for my $line ( <DATA> )
{
    # This is a no-op on plain &s
    my $utf8 = HTML::Entities::decode($line);
    print Encode::encode_utf8($utf8);
    my $proper = HTML::Entities::encode($utf8); # OR encode_numeric()
    print $proper;
}

__DATA__
Purus Accumsan Felis &#8240; Maecenas Nibh &theta; Eget Phasellus & Mi
+ Amet.  Odio Amet && Purus.  Mi Ullamcorper Lorem Eget Nibh.

http://www.example.com/?name=John&residence=Vatican+City&job=Pope
[download]

Comment on Re: Matching ampersands that are NOT part of an HTML entity? Select or Download Code

Replies are listed 'Best First'.
Re^2: Matching ampersands that are NOT part of an HTML entity? by EvanK (Chaplain) on Aug 07, 2008 at 15:12 UTC
I may be missing something, but it looks like you're printing and then discarding the utf8-encoded text, then continuing on with the non-utf8 text. Shouldn't it be something like this? `my $utf8 = HTML::Entities::decode($line); $utf8 = Encode::encode_utf8($utf8); my $proper = HTML::Entities::encode($utf8); print $proper;` [download] Update: Ah, nevermind, I misunderstood what you were saying initially. __________ Systems development is like banging your head against a wall... It's usually very painful, but if you're persistent, you'll get through it.	[reply] [d/l]
Re^3: Matching ampersands that are NOT part of an HTML entity? by Your Mother (Archbishop) on Aug 07, 2008 at 15:18 UTC
Uh... no. Did you run it? The `print Encode::encode_utf8($utf8);` is just there to see the intermediary step. `Encode::encode_utf8` makes the output "safe" for the terminal: no "wide character" warnings.	[reply] [d/l] [select]