in reply to Re: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
in thread Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

Hmmm. Thanks. Maybe it's not the character I think it is:

> $ perl -wMstrict -lE "qq/Screenshot-2024-02-23-at-1.05.14 AM.p +ng/ =~ /\s/ and say 'OK' or die" Died at -e line 1.

The #8239 popped in after submitting this post. It's not actually in the code.

#8239 is 202F in hex. I don't get this.

This doesn't even work:

> $ perl -wMstrict -lE "qq/Screenshot-2024-02-23-at-1.05.14 AM.p +ng/ =~ /\x{202F}/ and say 'OK' or die" + + + Died at -e line 1.

$PM = "Perl Monk's";
$MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
$nysus = $PM . ' ' . $MC;
Click here if you love Perl Monks

  • Comment on Re^2: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
  • Select or Download Code

Replies are listed 'Best First'.
Re^3: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by ikegami (Patriarch) on Aug 13, 2024 at 16:02 UTC

    Even though you entered a NNBSP on the command line, your program doesn't contain a NNBSP. (And I'm not referring to the appearance of  . That's due to a PerlMonks limitation.)

    By default, Perl programs are expected to be encoded using ASCII. NNBSP isn't found in the ASCII character set, so your program can't possibly include a NNBSP.

    Assuming a UTF-8 terminal, what you actually provided Perl is equivalent to "...\xE2\x80\xAF...". But a string containing a NNBSP would be "...\x{202F}...".

    You can tell Perl that the program is encoded using UTF-8 by adding use utf8;.

      OK, thank you! I'm getting closer but still confused AF. So this returns files as desired:

      use utf8; my $image_name = 'Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF'; my $files = $wac->get_all_files_in_dir($dir . '/uploads', qr/$image_na +me/);
      But this still does not match:
      use utf8; my $image_name = 'Screenshot-2024-02-23-at-1.05.14\s'; my $files = $wac->get_all_files_in_dir($dir . '/uploads', qr/$image_na +me/);

      I'm using neovim. It shows file is also encoded as UTF-8.

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks

        use utf8; has no effect in that program since it's encoded using ASCII. But it doesn't hurt since ASCII is a subset of UTF-8.

        The issue is that get_all_files_in_dir is matching against the still-encoded file names.

        The second program is effectively doing

        my $fn = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF"; $fn =~ /Screenshot-2024-02-23-at-1.05.14\s/

        That will only match if U+E2 is a space character, and it isn't.

Re^3: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by haukex (Archbishop) on Aug 13, 2024 at 16:00 UTC
    The #8239 popped in after submitting this post. It's not actually in the code.

    Yeah, PerlMonks does that to Unicode characters in <code> blocks - see my node here.

    In that case, I would suspect an encoding error - see ikegami's reply and my node here.

      I copy and pasted the file name into a file and did a hex dump:

      00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 012345678 +9ABCDEF 00000000 53 63 72 65 65 6E 73 68 - 6F 74 2D 32 30 32 34 2D Screensho +t-2024- 00000010 30 32 2D 32 33 2D 61 74 - 2D 31 2E 30 35 2E 31 34 02-23-at- +1.05.14 00000020 E2 80 AF 41 4D 2D 31 30 - 32 34 78 36 39 38 2E 70 ...AM-102 +4x698.p 00000030 6E 67 0A ng.

      E2 80 AF

      is the UTF8. I wonder if the acting of cutting and pasting is modifying the string. I'm using tmux.

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks