Hi, I am trying to find a solution around unexpected behavior of the output with my Perl script.

I have a set of files in non-English language. This particular language uses non-breaking space (U+00A0) quite often instead of regular space. The use of non-breaking space character is intentional, and it is very important to use the character instead of normal space character for this particular language.

What my Perl script does is to change language code "en" for English to something else that is appropriate to the language. So, I'm simply using the regular expression to search for particular sequence of letter and replacing them to something else. That's all it does. Then the script saves the text in a text-based file. Script does this for thousands of files.

The problem I'm encountering is that when the original file has non-breaking space character (U+00A0), Perl processes the text, but saves the non-breaking space character as "\_"

I'm reading the original file as UTF-8 file because the files are saved in UTF-8 to handle non-ASCII characters. All non-ASCII characters used in the foreign language are handled correctly with correct accents, but only the non-breaking space character is converted to something else in the output file.

For example, if I have input text:

issue: "Problém s odesláním"

The output text becomes as below:

issue: "Problém s\_odesláním"

The space character before the "s" character is a normal space character (U+0020), but the space character after the "s" character is a non-breaking space character (U+00A0).

Does anybody have any idea how to save the non-breaking space as non-breaking space character in without converting to "\_"? I'm experiencing this issue only in Macintosh environment. If I use the same Perl code in Windows environment, I do not have this issue. I appreciate any input. Thank you in advance.


In reply to Unexpected behavior of Perl on non-breaking space in Mac environment by hishii2001

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.