in reply to Re^2: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
in thread Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

Even though you entered a NNBSP on the command line, your program doesn't contain a NNBSP. (And I'm not referring to the appearance of  . That's due to a PerlMonks limitation.)

By default, Perl programs are expected to be encoded using ASCII. NNBSP isn't found in the ASCII character set, so your program can't possibly include a NNBSP.

Assuming a UTF-8 terminal, what you actually provided Perl is equivalent to "...\xE2\x80\xAF...". But a string containing a NNBSP would be "...\x{202F}...".

You can tell Perl that the program is encoded using UTF-8 by adding use utf8;.

  • Comment on Re^3: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
  • Select or Download Code

Replies are listed 'Best First'.
Re^4: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by nysus (Parson) on Aug 13, 2024 at 16:16 UTC

    OK, thank you! I'm getting closer but still confused AF. So this returns files as desired:

    use utf8; my $image_name = 'Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF'; my $files = $wac->get_all_files_in_dir($dir . '/uploads', qr/$image_na +me/);
    But this still does not match:
    use utf8; my $image_name = 'Screenshot-2024-02-23-at-1.05.14\s'; my $files = $wac->get_all_files_in_dir($dir . '/uploads', qr/$image_na +me/);

    I'm using neovim. It shows file is also encoded as UTF-8.

    $PM = "Perl Monk's";
    $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
    $nysus = $PM . ' ' . $MC;
    Click here if you love Perl Monks

      use utf8; has no effect in that program since it's encoded using ASCII. But it doesn't hurt since ASCII is a subset of UTF-8.

      The issue is that get_all_files_in_dir is matching against the still-encoded file names.

      The second program is effectively doing

      my $fn = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF"; $fn =~ /Screenshot-2024-02-23-at-1.05.14\s/

      That will only match if U+E2 is a space character, and it isn't.

        Ok, before you give up on me. I put the file named Screenshot-2024-02-23-at-1.05.14 AM.png in directory (with the hidden space charachter) along with this script in the same dir:

        #! /usr/bin/env perl use v5.36; use utf8; # get all the files in the current directory my @files = glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; say $file;

        The above reports: Use of uninitialized value $file in say at ./test.pl line 10.

        If I change the regex to /Screenshot-2024-02-23-at-1.05.14/ it works fine.

        I'm beginning to think Perl does not handle these chars in file names properly. But I'm clueless so that's a wild guess.

        EDIT: I should definitely mention I'm on macos which I heard doesn't have the best support for utf8</c>

        EDIT2: I tried this script on a linux docker container. Same result as on macOS

        $PM = "Perl Monk's";
        $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
        $nysus = $PM . ' ' . $MC;
        Click here if you love Perl Monks