Dearest Monks,
My application parses html, taking care to decode html entities with HTML::Entities::decode_entities(). However, this often leaves me with 'wide' characters.
Unicode specifies typographically distinct space characters:
U+2000 en quad
U+2001 em quad
U+2002 en space
U+2003 em space
U+2004 three-per-em space
U+2005 four-per-em space
etc.
and dash characters:
U+2010 hyphen
U+2011 non-breaking hyphen
U+2012 figure dash
U+2013 en dash
U+2014 em dash
etc.
Same for apostrophes, quotation marks, dash bullets, and others.
Many of these characters appear in the html my application processes with the result that I'm getting 'wide character' warnings and terminations ("wide character passed to subroutine").
Since my application is not rendering text, but only storing it in plaintext files, I have no need of these typographic variants and am perfectly content to use the basic ASCII-compatible equivalents, e.g., 0x20 for spaces, 0x2D for hyphens, and so on.
I'd therefore like to replace characters greater than 0xff with their ASCII equivalents. I could construct a table or regex for this purpose, but before doing so, I thought I'd ask whether there's an existing module I could use.
In particular, will normalizing text to Unicode Normalization Form KD with Unicode::Normalize do the job?
I'll appreciate your suggestions and advice.
Thank you & regards,
Michael
----------
mscudder@earthlink.net
In reply to unicode normalization by mscudder
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |