in reply to unicode normalization
I think I would approach this from the other direction i.e. convert the HTML entities to a suitable equivalent.
Tables of entities can be found at w3c. There are around 200 in total but you are probably going to be interested in about a dozen or so.
You could build a hash and then replace the entities:
my %lookup = ( 2019 => ', # replace &rsqu with an apostrophe 2010 => -, # hyphen etc. ); $text = s/(.)/$lookup{$1}?$lookup{$1}:$1/eg;
See The Björk Situation for a similar discussion on accents. My attempt (similar to the above) is much improved on by thundergnat and a useful discussion with rhesa on the perils of 'normalisation' (you are losing detail).
Hope this helps
|
|---|