G'day mldvx4,

From the examples you've shown, it looks like you may have a conflict between the UTF-16 used internally by MSWin and the UTF-8 used internally by Perl. That's a guess but it's the type of issue that I've seen in the past; you may have a different OS using UTF-? but that could well have similar problems.

The utf8 pragma only relates to your Perl source code. Have a look through that documentation for more details; and do note the emboldened text near the start of the DESCRIPTION.

Different versions of Perl have different levels of Unicode support. Check your version and see if its support (or lack thereof) might be related to your problems.

You don't indicate the source of your input data nor the target for the output. You may need to convert one or both within your script.

Take a look at the perl manpage. Under Reference Manual, you'll see a lot of links like "perluni*" — pick ones that are appropriate for your level of Unicode knowledge and read on from there.

With more information regarding OS, Perl version, I/O handling and so on; along with some sample code and input/output data; you may get a better answer.

Addendum: Regarding the substitution at the start of your post. I ran this quick test:

$ perl -E 'my $x = "A\N{NO-BREAK SPACE}B\N{NO-BREAK SPACE}C"; $x =~ s/ +\x{00A0}/ /g; say $x' A B C

Note that the source code only contains 7-bit ASCII characters. The is no need for the utf8 pragma here.

— Ken


In reply to Re: Safely removing Unicode zero-width spaces and other non-printing characters by kcott
in thread Safely removing Unicode zero-width spaces and other non-printing characters by mldvx4

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.