in reply to Matching ampersands that are NOT part of an HTML entity?

An approach that has served me well (perfectly so far) is to decode everything to utf8 first and then re-encode it to entities. It works because decoding a plain & is a no-op. So you're essentially normalizing the text and then encoding it. You will lose your original entities but the new ones will probably be better as they will be uniform. I recommend numeric entities; the example shows named ones.

use strict; use warnings; use HTML::Entities; use Encode; for my $line ( <DATA> ) { # This is a no-op on plain &s my $utf8 = HTML::Entities::decode($line); print Encode::encode_utf8($utf8); my $proper = HTML::Entities::encode($utf8); # OR encode_numeric() print $proper; } __DATA__ Purus Accumsan Felis &#8240; Maecenas Nibh &theta; Eget Phasellus & Mi + Amet. Odio Amet && Purus. Mi Ullamcorper Lorem Eget Nibh. http://www.example.com/?name=John&residence=Vatican+City&job=Pope

Replies are listed 'Best First'.
Re^2: Matching ampersands that are NOT part of an HTML entity?
by EvanK (Chaplain) on Aug 07, 2008 at 15:12 UTC
    I may be missing something, but it looks like you're printing and then discarding the utf8-encoded text, then continuing on with the non-utf8 text. Shouldn't it be something like this?
    my $utf8 = HTML::Entities::decode($line); $utf8 = Encode::encode_utf8($utf8); my $proper = HTML::Entities::encode($utf8); print $proper;
    Update: Ah, nevermind, I misunderstood what you were saying initially.

    __________
    Systems development is like banging your head against a wall...
    It's usually very painful, but if you're persistent, you'll get through it.

      Uh... no. Did you run it? The print Encode::encode_utf8($utf8); is just there to see the intermediary step. Encode::encode_utf8 makes the output "safe" for the terminal: no "wide character" warnings.