Re: Removing certain non-word characters

The approaches that enumerate a-z and A-Z don't play nice with locales. For example, in the Portuguese character set, you will have vowels with ~, ', `, and ^ over them as part of the alphabet, and they don't fall within the range of a-z.

\w is locale-smart, but has the unfortunate disadvantage of also containing '_' (underscore). So if you were to use \w, you would have to figure out some way of using s/// to eliminate all \W characters except hyphen, space, and tick, plus eliminate underscore. That can get a little convoluted.

The easiest solution might be to use a couple of regexes instead of just one. Another solution might be to match what you want and leave out the rest. A solution that I considered (and Zaxo also mentioned in the CB) is to use the oft-neglected POSIX character classes:

$string =~ s/[^[:alnum:]\s'-]//g;
[download]

Which says, "Substitute anything that is not alphanumeric, space, tick, or hyphen, with nothing (ie, just get rid of it)."

Posix gets along with locales, so if your code ever ended up getting run in an environment where use locale; is in effect, it shouldn't break.

Dave

Comment on Re: Removing certain non-word characters Select or Download Code