in reply to Re^3: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
in thread Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

OK, thank you! I'm getting closer but still confused AF. So this returns files as desired:

use utf8; my $image_name = 'Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF'; my $files = $wac->get_all_files_in_dir($dir . '/uploads', qr/$image_na +me/);
But this still does not match:
use utf8; my $image_name = 'Screenshot-2024-02-23-at-1.05.14\s'; my $files = $wac->get_all_files_in_dir($dir . '/uploads', qr/$image_na +me/);

I'm using neovim. It shows file is also encoded as UTF-8.

$PM = "Perl Monk's";
$MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
$nysus = $PM . ' ' . $MC;
Click here if you love Perl Monks

  • Comment on Re^4: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
  • Select or Download Code

Replies are listed 'Best First'.
Re^5: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by ikegami (Patriarch) on Aug 13, 2024 at 16:28 UTC

    use utf8; has no effect in that program since it's encoded using ASCII. But it doesn't hurt since ASCII is a subset of UTF-8.

    The issue is that get_all_files_in_dir is matching against the still-encoded file names.

    The second program is effectively doing

    my $fn = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF"; $fn =~ /Screenshot-2024-02-23-at-1.05.14\s/

    That will only match if U+E2 is a space character, and it isn't.

      Ok, before you give up on me. I put the file named Screenshot-2024-02-23-at-1.05.14 AM.png in directory (with the hidden space charachter) along with this script in the same dir:

      #! /usr/bin/env perl use v5.36; use utf8; # get all the files in the current directory my @files = glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; say $file;

      The above reports: Use of uninitialized value $file in say at ./test.pl line 10.

      If I change the regex to /Screenshot-2024-02-23-at-1.05.14/ it works fine.

      I'm beginning to think Perl does not handle these chars in file names properly. But I'm clueless so that's a wild guess.

      EDIT: I should definitely mention I'm on macos which I heard doesn't have the best support for utf8</c>

      EDIT2: I tried this script on a linux docker container. Same result as on macOS

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks

        This is no different than before with get_all_files_in_dir. The program is effectively doing

        local $_ = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAFAM.png"; /Screenshot-2024-02-23-at-1.05.14\s/

        That will only match if U+E2 is a space character, and it isn't.

        These would match:

        local $_ = "Screenshot-2024-02-23-at-1.05.14\x{202F}AM.png"; /Screenshot-2024-02-23-at-1.05.14\s/
        local $_ = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAFAM.png"; utf8::decode( $_ ); /Screenshot-2024-02-23-at-1.05.14\s/

        Ok, I gotta think we are looking at a bug here:

        #! /usr/bin/env perl use v5.36; use utf8; # get all the files in the current directory my @files = glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS!

        The screenshot file definitely exists in the directory.

        $PM = "Perl Monk's";
        $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
        $nysus = $PM . ' ' . $MC;
        Click here if you love Perl Monks