in reply to Safely removing Unicode zero-width spaces and other non-printing characters

Desktop Thatâ<U+0080><U+0099>s More Elegant Which should look like this instead:

Desktop That’s More Elegant

You know, in HTML, it is possible to insert codes that produce UTF characters on the screen, and they exist in case you want the source code to be simple ASCII characters only. No UTF. I prefer that, because as you said, the UTF characters can mess up the code. For example, the above text should be:

Desktop That&rsquo;s More Elegant

How to encode UTF characters in HTML

If I had the same problem, I would write a perl sub that replaces all these specific characters with the HTML equivalent first, and then just remove all 00 characters from the entire text and deal with the spaces and line breaks last.

  • Comment on Re: Safely removing Unicode zero-width spaces and other non-printing characters

Replies are listed 'Best First'.
Re^2: Safely removing Unicode zero-width spaces and other non-printing characters
by haukex (Archbishop) on Dec 04, 2019 at 19:11 UTC
    in HTML, it is possible to insert codes that produce UTF characters on the screen

    That's a possibility. However, there are also escape codes to allow representing arbitrary Unicode characters, such as "\N{U+NNNN}", which are implemented natively in Perl.

    I would write a perl sub that replaces all these specific characters with the HTML equivalent first

    No need to write a function yourself: HTML::Entities.