in reply to Re: Safely removing Unicode zero-width spaces and other non-printing characters
in thread Safely removing Unicode zero-width spaces and other non-printing characters
The source of the data is a large number of RSS feeds used which point to an even larger number of individual web pages. The latter are what are harvested and processed with a few scripts. So normalizing the data at the source is not an option, since so few webmasters even publish mail addresses let alone fix their sites.
Maybe there is a CPAN module or simple method to forcibly convert the incoming data (or outgoing data) to UTF? Just calling it UTF-8 fails, too: binmode(STDOUT, ":encoding(utf8)"); Is there a way to find out if it should be labeled UTF-16 instead? If so then how to force that mode?
$ apt-cache policy perl | head -n 3 perl: Installed: 5.28.1-6 Candidate: 5.28.1-6
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Safely removing Unicode zero-width spaces and other non-printing characters
by haj (Vicar) on Dec 04, 2019 at 10:37 UTC | |
|
Re^3: Safely removing Unicode zero-width spaces and other non-printing characters
by haukex (Archbishop) on Dec 04, 2019 at 19:21 UTC | |
by mldvx4 (Hermit) on Dec 05, 2019 at 05:33 UTC | |
by haukex (Archbishop) on Dec 05, 2019 at 05:49 UTC |