Re: unicode normalization

I think I would approach this from the other direction i.e. convert the HTML entities to a suitable equivalent.

Tables of entities can be found at w3c. There are around 200 in total but you are probably going to be interested in about a dozen or so.

You could build a hash and then replace the entities:

my %lookup = (
 2019 => ', # replace &rsqu with an apostrophe
 2010 => -, # hyphen etc.
);

$text = s/(.)/$lookup{$1}?$lookup{$1}:$1/eg;
[download]

See The Björk Situation for a similar discussion on accents. My attempt (similar to the above) is much improved on by thundergnat and a useful discussion with rhesa on the perils of 'normalisation' (you are losing detail).

Hope this helps

Comment on Re: unicode normalization Download Code