comment on

Dearest Monks,

My application parses html, taking care to decode html entities with HTML::Entities::decode_entities(). However, this often leaves me with 'wide' characters.

Unicode specifies typographically distinct space characters:

U+2000 en quad
U+2001 em quad
U+2002 en space
U+2003 em space
U+2004 three-per-em space
U+2005 four-per-em space
etc.

and dash characters:

U+2010 hyphen
U+2011 non-breaking hyphen
U+2012 figure dash
U+2013 en dash
U+2014 em dash
etc.

Same for apostrophes, quotation marks, dash bullets, and others.

Many of these characters appear in the html my application processes with the result that I'm getting 'wide character' warnings and terminations ("wide character passed to subroutine").

Since my application is not rendering text, but only storing it in plaintext files, I have no need of these typographic variants and am perfectly content to use the basic ASCII-compatible equivalents, e.g., 0x20 for spaces, 0x2D for hyphens, and so on.

I'd therefore like to replace characters greater than 0xff with their ASCII equivalents. I could construct a table or regex for this purpose, but before doing so, I thought I'd ask whether there's an existing module I could use.

In particular, will normalizing text to Unicode Normalization Form KD with Unicode::Normalize do the job?

I'll appreciate your suggestions and advice.

Thank you & regards,
Michael
----------
mscudder@earthlink.net

In reply to unicode normalization by mscudder

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.