in reply to Re^7: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
in thread Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

my @files = glob("*");

Here, the filenames you read in are encoded as bytes. See them as bytes:

use Data::Dumper; $Data::Dumper::Useqq = 1; my @files = glob("*"); for my $file (@files) { say Dumper $file; };

Now, if you want to use Unicode matching semantics, you want to decode your filenames from the filesystem representation into Unicode:

use Encode 'decode'; use Data::Dumper; $Data::Dumper::Useqq = 1; my @files = map { decode 'UTF-8', $_ } glob("*"); for my $file (@files) { say Dumper $file; };

The filesystem operations take raw strings, but your regular expression takes a Unicode string. Use the correct one in each situation.

  • Comment on Re^8: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
  • Select or Download Code

Replies are listed 'Best First'.
Re^9: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by nysus (Parson) on Aug 13, 2024 at 18:09 UTC

    Corion, it looks like you are right after all. I need to also change how the hex codes were printed and use sprintf "%vX"

    Wow, this is all so confusing. But I'm learning (hopefully). Thank you!

    $PM = "Perl Monk's";
    $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
    $nysus = $PM . ' ' . $MC;
    Click here if you love Perl Monks

Re^9: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by nysus (Parson) on Aug 13, 2024 at 17:33 UTC

    Thanks, but I'm not sure I follow. I tried applying decode to file names:

    #! /usr/bin/env perl use v5.36; use Encode 'decode'; # get all the files in the current directory my @files = map { decode 'UTF-8', $_ } glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $ss = $files[0]; my $hex = unpack("H*", $ss); say $hex; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my $hex2 = unpack("H*", $blah); say $hex2; say $hex eq $hex2 ? "hexes equal" : "hexes not equal"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS!

    OUTPUTS:

    Character in 'H' format wrapped in unpack at ./test.pl line 12. 53637265656e73686f742d323032342d30322d32332d61742d312e30352e31342f414d +2e706e67 Wide character in say at ./test.pl line 15. Screenshot-2024-02-23-at-1.05.14 AM.png 53637265656e73686f742d323032342d30322d32332d61742d312e30352e3134e280af +414d2e706e67 hexes not equal

    $PM = "Perl Monk's";
    $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
    $nysus = $PM . ' ' . $MC;
    Click here if you love Perl Monks

      And we're back to the original question (answered here): You're missing use utf8;.

      Also, you should be using sprintf "%vX", $_ instead of unpack "H*", $_. The former handles any strings. The latter only handles strings of bytes (strings where the characters are no higher than 0xFF), so it's definitely inappropriate here.

      #!/usr/bin/perl
      
      use v5.36;
      use warnings;
      
      # Source code encoded using UTF-8.
      use utf8;
      
      # Terminal provides/expects UTF-8 (for `say`).
      use open ":std", ":encoding(UTF-8)"; 
      
      use Encode qw( decode_utf8 );
      
      my $base = "Screenshot-2024-02-23-at-1.05.14\x{202F}AM.png";
      my $lit  = "Screenshot-2024-02-23-at-1.05.14 AM.png";
      
      my @files = map { decode_utf8 $_ } glob( "*" );
      my ( $file ) = grep { /^Screenshot-2024-02-23-at-1.05.14\s/ } @files;
      
      my $base_hex = sprintf "%vX", $hex;
      
      for ( $base, $lit, $file ) {
         say $_;
         say $_ eq $base ? "same" : "different";
      
         my $hex = sprintf "%vX", $_;
         say $hex;
         say $hex eq $base_hex ? "same" : "different";
      }
      

      Output:

      Screenshot-2024-02-23-at-1.05.14 AM.png
      same
      53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67
      same
      Screenshot-2024-02-23-at-1.05.14 AM.png
      same
      53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67
      same
      Screenshot-2024-02-23-at-1.05.14 AM.png
      same
      53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67
      same
      

        Wait, SORRY! You are right! Looks like I added a stray semi colon to the name of the file in $blah and so script was still failing. Holy crap I'm an idiot. Using `utf8` does get the two hex dumps to match now. Ok, now to wrap my head around all this. Jesus.

        THANK YOU!

        $PM = "Perl Monk's";
        $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
        $nysus = $PM . ' ' . $MC;
        Click here if you love Perl Monks

        It makes no difference if if I use utf8 or not (and I thought using us v5.36 set utf8 out of the box, anyway). It sill fails.

        Also, same result with sprintf "%vX", just slightly different output. Try it:

        #! /usr/bin/env perl use v5.36; use utf8; use Encode 'decode'; # get all the files in the current directory my @files = map { decode 'UTF-8', $_ } glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $ss = $files[0]; my $hex = sprintf "%vX", $ss; say $hex; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my $hex2 = sprintf "%vX", $blah; say $hex2; say $hex eq $hex2 ? "hexes equal" : "hexes not equal"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS!

        OUTPUTS:

        53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.7 +4.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 Wide character in say at ./test.pl line 16. Screenshot-2024-02-23-at-1.05.14 AM.png 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.7 +4.2D.31.2E.30.35.2E.31.34.26.23.38.32.33.39.3B.41.4D.2E.70.6E.67 hexes not equal

        $PM = "Perl Monk's";
        $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
        $nysus = $PM . ' ' . $MC;
        Click here if you love Perl Monks

      Why do you use literals in your source code?

      You have the UTF-8 in your source code:

      my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png";

      ... but you never tell Perl that your source code should be seen as utf8.

      Don't use UTF-8 in your source code unless you also tell Perl about it.

      Also, you will have noted already that your regular expression matches on the decoded filename.

        PerlMonks website added them in. I just forgot to delete them. But they are narrow non-breaking space characters.

        $PM = "Perl Monk's";
        $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
        $nysus = $PM . ' ' . $MC;
        Click here if you love Perl Monks