in reply to Re^4: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
in thread Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

use utf8; has no effect in that program since it's encoded using ASCII. But it doesn't hurt since ASCII is a subset of UTF-8.

The issue is that get_all_files_in_dir is matching against the still-encoded file names.

The second program is effectively doing

my $fn = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF"; $fn =~ /Screenshot-2024-02-23-at-1.05.14\s/

That will only match if U+E2 is a space character, and it isn't.

  • Comment on Re^5: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
  • Select or Download Code

Replies are listed 'Best First'.
Re^6: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by nysus (Parson) on Aug 13, 2024 at 16:39 UTC

    Ok, before you give up on me. I put the file named Screenshot-2024-02-23-at-1.05.14 AM.png in directory (with the hidden space charachter) along with this script in the same dir:

    #! /usr/bin/env perl use v5.36; use utf8; # get all the files in the current directory my @files = glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; say $file;

    The above reports: Use of uninitialized value $file in say at ./test.pl line 10.

    If I change the regex to /Screenshot-2024-02-23-at-1.05.14/ it works fine.

    I'm beginning to think Perl does not handle these chars in file names properly. But I'm clueless so that's a wild guess.

    EDIT: I should definitely mention I'm on macos which I heard doesn't have the best support for utf8</c>

    EDIT2: I tried this script on a linux docker container. Same result as on macOS

    $PM = "Perl Monk's";
    $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
    $nysus = $PM . ' ' . $MC;
    Click here if you love Perl Monks

      This is no different than before with get_all_files_in_dir. The program is effectively doing

      local $_ = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAFAM.png"; /Screenshot-2024-02-23-at-1.05.14\s/

      That will only match if U+E2 is a space character, and it isn't.

      These would match:

      local $_ = "Screenshot-2024-02-23-at-1.05.14\x{202F}AM.png"; /Screenshot-2024-02-23-at-1.05.14\s/
      local $_ = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAFAM.png"; utf8::decode( $_ ); /Screenshot-2024-02-23-at-1.05.14\s/

      Ok, I gotta think we are looking at a bug here:

      #! /usr/bin/env perl use v5.36; use utf8; # get all the files in the current directory my @files = glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS!

      The screenshot file definitely exists in the directory.

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks

        my @files = glob("*");

        Here, the filenames you read in are encoded as bytes. See them as bytes:

        use Data::Dumper; $Data::Dumper::Useqq = 1; my @files = glob("*"); for my $file (@files) { say Dumper $file; };

        Now, if you want to use Unicode matching semantics, you want to decode your filenames from the filesystem representation into Unicode:

        use Encode 'decode'; use Data::Dumper; $Data::Dumper::Useqq = 1; my @files = map { decode 'UTF-8', $_ } glob("*"); for my $file (@files) { say Dumper $file; };

        The filesystem operations take raw strings, but your regular expression takes a Unicode string. Use the correct one in each situation.

        Definitely a bug (I think):

        #! /usr/bin/env perl use v5.36; use utf8; # get all the files in the current directory my @files = glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $ss = $files[0]; my $hex = unpack("H*", $ss); say $hex; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my $hex2 = unpack("H*", $blah); say $hex2; say $hex eq $hex2 ? "hexes equal" : "hexes not equal"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS!

        OUTPUT:

        53637265656e73686f742d323032342d30322d32332d61742d312e30352e3134e280af +414d2e706e67 Use of uninitialized value $file in say at ./test.pl line 14. Character in 'H' format wrapped in unpack at ./test.pl line 17. 53637265656e73686f742d323032342d30322d32332d61742d312e30352e31342f414d +2e706e67 hexes not equal 1

        $PM = "Perl Monk's";
        $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
        $nysus = $PM . ' ' . $MC;
        Click here if you love Perl Monks