nysus has asked for the wisdom of the Perl Monks concerning the following question:

Got a rude reminder of my cluelessness this morning, losing over an hour trying to figure out why my regex wasn't working on files with a spaces in them. Turns out they weren't spaces but UTF-8 characters, NARROW NO-BREAK SPACE.

I'm running a newer version of Perl, 5.36. I see old posts about this issue but nothing recent. Wondering if there is a clean way of handling these things other then doing a tr// on every string. Ideally, I'd like to get \s to match it.

$PM = "Perl Monk's";
$MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
$nysus = $PM . ' ' . $MC;
Click here if you love Perl Monks

  • Comment on Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
  • Select or Download Code

Replies are listed 'Best First'.
Re: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by haukex (Archbishop) on Aug 13, 2024 at 15:35 UTC

    SSCCE please... \s works for me. Are you sure you're not using the /a modifier or something like that?

    perl -wMstrict -lE "qq/\N{NARROW NO-BREAK SPACE}/ =~ /\A\s\z/ and say +'OK' or die" OK

      Hmmm. Thanks. Maybe it's not the character I think it is:

      > $ perl -wMstrict -lE "qq/Screenshot-2024-02-23-at-1.05.14 AM.p +ng/ =~ /\s/ and say 'OK' or die" Died at -e line 1.

      The #8239 popped in after submitting this post. It's not actually in the code.

      #8239 is 202F in hex. I don't get this.

      This doesn't even work:

      > $ perl -wMstrict -lE "qq/Screenshot-2024-02-23-at-1.05.14 AM.p +ng/ =~ /\x{202F}/ and say 'OK' or die" + + + Died at -e line 1.

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks

        Even though you entered a NNBSP on the command line, your program doesn't contain a NNBSP. (And I'm not referring to the appearance of  . That's due to a PerlMonks limitation.)

        By default, Perl programs are expected to be encoded using ASCII. NNBSP isn't found in the ASCII character set, so your program can't possibly include a NNBSP.

        Assuming a UTF-8 terminal, what you actually provided Perl is equivalent to "...\xE2\x80\xAF...". But a string containing a NNBSP would be "...\x{202F}...".

        You can tell Perl that the program is encoded using UTF-8 by adding use utf8;.

        The #8239 popped in after submitting this post. It's not actually in the code.

        Yeah, PerlMonks does that to Unicode characters in <code> blocks - see my node here.

        In that case, I would suspect an encoding error - see ikegami's reply and my node here.

Re: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by ikegami (Patriarch) on Aug 13, 2024 at 15:41 UTC

    \s matches whitespace characters, which includes U+202F NARROW NO-BREAK SPACE.

    $ perl -le'print "\x{202F}" =~ /^\s\z/ ? "match" : "no match"' match

    Do you have a NNBSP, or do you have its UTF-8 encoding? Don't forget to decode your inputs (and encode your outputs)!

    If you need further help, please provide the output of sprintf( "%vX", $_ ) for a string that supposedly includes a NNBSP.

      I'm not sure what I have. The file definitely has an invisible character in it. When I copy and paste the file name:

      Works:

      perl -wMstrict -lE "qq/Screenshot-2024-02-23-at-1.05.14 AM.png/ =~ /A/ + and say 'OK' or die"

      Doesn't work:

      perl -wMstrict -lE "qq/Screenshot-2024-02-23-at-1.05.14 AM.png/ =~ /\s +/ and say 'OK' or die"

      The character between the "4" and "A" is the invisible character.

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks

        I would strongly recommend not using the command line, as it can have its own encoding issues, instead put everything in a script; if it contains non-ASCII characters, make sure to save it as UTF-8 and use utf8;.