The source of the data is a large number of RSS feeds used which point to an even larger number of individual web pages. The latter are what are harvested and processed with a few scripts. So normalizing the data at the source is not an option, since so few webmasters even publish mail addresses let alone fix their sites.
Maybe there is a CPAN module or simple method to forcibly convert the incoming data (or outgoing data) to UTF? Just calling it UTF-8 fails, too: binmode(STDOUT, ":encoding(utf8)"); Is there a way to find out if it should be labeled UTF-16 instead? If so then how to force that mode?
$ apt-cache policy perl | head -n 3 perl: Installed: 5.28.1-6 Candidate: 5.28.1-6
In reply to Re^2: Safely removing Unicode zero-width spaces and other non-printing characters
by mldvx4
in thread Safely removing Unicode zero-width spaces and other non-printing characters
by mldvx4
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |