in reply to Re^8: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
in thread Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

Thanks, but I'm not sure I follow. I tried applying decode to file names:

#! /usr/bin/env perl use v5.36; use Encode 'decode'; # get all the files in the current directory my @files = map { decode 'UTF-8', $_ } glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $ss = $files[0]; my $hex = unpack("H*", $ss); say $hex; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my $hex2 = unpack("H*", $blah); say $hex2; say $hex eq $hex2 ? "hexes equal" : "hexes not equal"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS!

OUTPUTS:

Character in 'H' format wrapped in unpack at ./test.pl line 12. 53637265656e73686f742d323032342d30322d32332d61742d312e30352e31342f414d +2e706e67 Wide character in say at ./test.pl line 15. Screenshot-2024-02-23-at-1.05.14 AM.png 53637265656e73686f742d323032342d30322d32332d61742d312e30352e3134e280af +414d2e706e67 hexes not equal

$PM = "Perl Monk's";
$MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
$nysus = $PM . ' ' . $MC;
Click here if you love Perl Monks

  • Comment on Re^9: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
  • Select or Download Code

Replies are listed 'Best First'.
Re^10: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by ikegami (Patriarch) on Aug 13, 2024 at 17:43 UTC

    And we're back to the original question (answered here): You're missing use utf8;.

    Also, you should be using sprintf "%vX", $_ instead of unpack "H*", $_. The former handles any strings. The latter only handles strings of bytes (strings where the characters are no higher than 0xFF), so it's definitely inappropriate here.

    #!/usr/bin/perl
    
    use v5.36;
    use warnings;
    
    # Source code encoded using UTF-8.
    use utf8;
    
    # Terminal provides/expects UTF-8 (for `say`).
    use open ":std", ":encoding(UTF-8)"; 
    
    use Encode qw( decode_utf8 );
    
    my $base = "Screenshot-2024-02-23-at-1.05.14\x{202F}AM.png";
    my $lit  = "Screenshot-2024-02-23-at-1.05.14 AM.png";
    
    my @files = map { decode_utf8 $_ } glob( "*" );
    my ( $file ) = grep { /^Screenshot-2024-02-23-at-1.05.14\s/ } @files;
    
    my $base_hex = sprintf "%vX", $hex;
    
    for ( $base, $lit, $file ) {
       say $_;
       say $_ eq $base ? "same" : "different";
    
       my $hex = sprintf "%vX", $_;
       say $hex;
       say $hex eq $base_hex ? "same" : "different";
    }
    

    Output:

    Screenshot-2024-02-23-at-1.05.14 AM.png
    same
    53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67
    same
    Screenshot-2024-02-23-at-1.05.14 AM.png
    same
    53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67
    same
    Screenshot-2024-02-23-at-1.05.14 AM.png
    same
    53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67
    same
    

      Wait, SORRY! You are right! Looks like I added a stray semi colon to the name of the file in $blah and so script was still failing. Holy crap I'm an idiot. Using `utf8` does get the two hex dumps to match now. Ok, now to wrap my head around all this. Jesus.

      THANK YOU!

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks

      It makes no difference if if I use utf8 or not (and I thought using us v5.36 set utf8 out of the box, anyway). It sill fails.

      Also, same result with sprintf "%vX", just slightly different output. Try it:

      #! /usr/bin/env perl use v5.36; use utf8; use Encode 'decode'; # get all the files in the current directory my @files = map { decode 'UTF-8', $_ } glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $ss = $files[0]; my $hex = sprintf "%vX", $ss; say $hex; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my $hex2 = sprintf "%vX", $blah; say $hex2; say $hex eq $hex2 ? "hexes equal" : "hexes not equal"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS!

      OUTPUTS:

      53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.7 +4.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 Wide character in say at ./test.pl line 16. Screenshot-2024-02-23-at-1.05.14 AM.png 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.7 +4.2D.31.2E.30.35.2E.31.34.26.23.38.32.33.39.3B.41.4D.2E.70.6E.67 hexes not equal

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks

        You used the 7 character string   (26.23.38.32.33.39.3B) instead of a NNBSP (202F) in your program X_X.

        I added a program to my previous post.

Re^10: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?
by Corion (Patriarch) on Aug 13, 2024 at 17:39 UTC

    Why do you use literals in your source code?

    You have the UTF-8 in your source code:

    my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png";

    ... but you never tell Perl that your source code should be seen as utf8.

    Don't use UTF-8 in your source code unless you also tell Perl about it.

    Also, you will have noted already that your regular expression matches on the decoded filename.

      PerlMonks website added them in. I just forgot to delete them. But they are narrow non-breaking space characters.

      $PM = "Perl Monk's";
      $MC = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar Parson";
      $nysus = $PM . ' ' . $MC;
      Click here if you love Perl Monks

        Yes, but they are what I mean by "literals". If you have any byte above 127 in your source code, you need to tell Perl what encoding your source code is in if it is not Latin-1.

        You have something that is UTF-8, but you are not telling Perl that your source code contains UTF-8.