You're task seems more like "replace a wide character whenever there is an obvious ascii substitute", which is much simpler; this could apply to various quotes, brackets and other punctuation as well as spaces and hyphens/dashes. (The use of wide-character "smart quotes" seems to be on the rise).
If you have wide-character spaces in a utf8 string that was decoded from HTML, turning them all into ascii spaces is easy:
(Of course, that will apply to newlines and tabs as well, but with html data, this isn't likely to be a problem.)s/\s/ /g;
As for the various punctuation marks, if you already know which wide characters to expect, just put those into a regex character class:
If you run across any wide characters besides those, you can look them up pretty easily and add to your character classes as needed. Here's a simple script for getting the names of various codepoints, codepoints that match various names, etc.my $dashes = join '', map { chr() } ( 0xAD, 0x2010 .. 0x2015, 0xFE63, +0xFF0D ); my $squots = join '', map { chr() } ( 0x02BC, 0x2018 .. 0x201B ); my $dquots = join '', map { chr() } ( 0x02EE, 0x201C .. 0x201F ); s/[$dashes]/-/g; s/[$squots]/'/g; s/[$dquots]/"/g;
(update: fixed a typo in the assignment to "$dashes")
In reply to Re: unicode normalization
by graff
in thread unicode normalization
by mscudder
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |