hishii2001 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am trying to find a solution around unexpected behavior of the output with my Perl script.

I have a set of files in non-English language. This particular language uses non-breaking space (U+00A0) quite often instead of regular space. The use of non-breaking space character is intentional, and it is very important to use the character instead of normal space character for this particular language.

What my Perl script does is to change language code "en" for English to something else that is appropriate to the language. So, I'm simply using the regular expression to search for particular sequence of letter and replacing them to something else. That's all it does. Then the script saves the text in a text-based file. Script does this for thousands of files.

The problem I'm encountering is that when the original file has non-breaking space character (U+00A0), Perl processes the text, but saves the non-breaking space character as "\_"

I'm reading the original file as UTF-8 file because the files are saved in UTF-8 to handle non-ASCII characters. All non-ASCII characters used in the foreign language are handled correctly with correct accents, but only the non-breaking space character is converted to something else in the output file.

For example, if I have input text:

issue: "Problém s odesláním"

The output text becomes as below:

issue: "Problém s\_odesláním"

The space character before the "s" character is a normal space character (U+0020), but the space character after the "s" character is a non-breaking space character (U+00A0).

Does anybody have any idea how to save the non-breaking space as non-breaking space character in without converting to "\_"? I'm experiencing this issue only in Macintosh environment. If I use the same Perl code in Windows environment, I do not have this issue. I appreciate any input. Thank you in advance.

Replies are listed 'Best First'.
Re: Unexpected behavior of Perl on non-breaking space in Mac environment
by choroba (Cardinal) on Apr 20, 2022 at 13:53 UTC
    Can you reduce the code to a few lines and post them? I fear it's impossible for us to reproduce the described behaviour.
    #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; my $string = qq(issue: "Problém s\N{NO-BREAK SPACE}odesláním"); binmode *STDOUT, ':encoding(UTF-8)'; say $string;
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      I apologize. The input file was a wrong version (my code was grabbing a file from wrong dir), making me think that Perl was converting non-breaking space to "\_". I sincerely apologize for the trouble.
        Suggestion: edit the original post title to have "SOLVED" on the end? (Glad it is, by the way!)
Re: Unexpected behavior of Perl on non-breaking space in Mac environment
by haukex (Archbishop) on Apr 20, 2022 at 13:53 UTC
    Perl processes the text, but saves the non-breaking space character as "\_"

    Sorry, but I think this is likely not the explanation. Instead:

    I'm simply using the regular expression to search for particular sequence of letter and replacing them to something else.

    Probably this regular expression is making this change, or it could maybe be an encoding issue, but you haven't shown us any code at all, so we currently can't help you.

    However, if you can post a Short, Self-Contained, Correct Example that reproduces the issue then I'm sure we can help. In this node I describe what information you should include in your post when debugging unicode issues.

      Thank you for your input, and my apology. I realized that my script is grabbing a wrong version of file that already has non-breaking space having as "\_" for some reason. After grabbing the correct version of file set, it's behaving as expected. Thank you as well for pointing me to https://www.perlmonks.org/?node_id=1233709
Re: Unexpected behavior of Perl on non-breaking space in Mac environment
by hishii2001 (Novice) on Apr 20, 2022 at 14:18 UTC
    Please discard this post. I found the cause of the issue is due to processing incorrect version of file set. My sincere apology.

      Don't feel bad because stuff like this happens to all of us. One of the best debugging steps you can take when you hit an apparent dead end is to walk away for a while and work on something completely different before coming back for a second look. I'll bet that may be what happened here: you posted the question then came back after a bit and figured out you weren't looking where you thought you were and then everything shook out.

      Coming at things with a fresh(er) perspective can help illuminate the blindingly obvious (because when you're heads down it's easy for the obvious to not be so obvious because your focus is too sharp elsewhere).

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: Unexpected behavior of Perl on non-breaking space in Mac environment
by AnomalousMonk (Archbishop) on Apr 21, 2022 at 00:55 UTC