Re^9: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

Thanks, but I'm not sure I follow. I tried applying decode to file names:


#! /usr/bin/env perl

use v5.36;

use Encode 'decode';

# get all the files in the current directory
my @files = map { decode 'UTF-8', $_ } glob("*");
my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files;

my $ss = $files[0];
my $hex = unpack("H*", $ss);
say $hex;

say $file; # ERROR!

my $blah = "Screenshot-2024-02-23-at-1.05.14&#8239;AM.png";
my $hex2 = unpack("H*", $blah);
say $hex2;

say $hex eq $hex2 ? "hexes equal" : "hexes not equal";

say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/;  # WORKS!
[download]

OUTPUTS:

Character in 'H' format wrapped in unpack at ./test.pl line 12.
53637265656e73686f742d323032342d30322d32332d61742d312e30352e31342f414d
+2e706e67
Wide character in say at ./test.pl line 15.
Screenshot-2024-02-23-at-1.05.14&#8239;AM.png
53637265656e73686f742d323032342d30322d32332d61742d312e30352e3134e280af
+414d2e706e67
hexes not equal
[download]

$PM = "Perl Monk's";
$MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson";
$nysus = $PM . ' ' . $MC;
Click here if you love Perl Monks

Comment on Re^9: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? Select or Download Code

Replies are listed 'Best First'.
Re^10: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by ikegami (Patriarch) on Aug 13, 2024 at 17:43 UTC
And we're back to the original question (answered here): You're missing `use utf8;`. Also, you should be using `sprintf "%vX", $_` instead of `unpack "H", $_`. The former handles any strings. The latter only handles strings of bytes (strings where the characters are no higher than 0xFF), so it's definitely inappropriate here. #!/usr/bin/perl use v5.36; use warnings; # Source code encoded using UTF-8. use utf8; # Terminal provides/expects UTF-8 (for `say`). use open ":std", ":encoding(UTF-8)"; use Encode qw( decode_utf8 ); my $base = "Screenshot-2024-02-23-at-1.05.14\x{202F}AM.png"; my $lit = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my @files = map { decode_utf8 $_ } glob( "" ); my ( $file ) = grep { /^Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $base_hex = sprintf "%vX", $hex; for ( $base, $lit, $file ) { say $_; say $_ eq $base ? "same" : "different"; my $hex = sprintf "%vX", $_; say $hex; say $hex eq $base_hex ? "same" : "different"; } Output: Screenshot-2024-02-23-at-1.05.14 AM.png same 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 same Screenshot-2024-02-23-at-1.05.14 AM.png same 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 same Screenshot-2024-02-23-at-1.05.14 AM.png same 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 same	[reply] [d/l] [select]
Re^11: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 18:06 UTC
Wait, SORRY! You are right! Looks like I added a stray semi colon to the name of the file in $blah and so script was still failing. Holy crap I'm an idiot. Using `utf8` does get the two hex dumps to match now. Ok, now to wrap my head around all this. Jesus. THANK YOU! $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply]
Re^11: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 17:55 UTC
It makes no difference if if I use utf8 or not (and I thought using us v5.36 set utf8 out of the box, anyway). It sill fails. Also, same result with sprintf "%vX", just slightly different output. Try it: #! /usr/bin/env perl use v5.36; use utf8; use Encode 'decode'; # get all the files in the current directory my @files = map { decode 'UTF-8', $_ } glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $ss = $files[0]; my $hex = sprintf "%vX", $ss; say $hex; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my $hex2 = sprintf "%vX", $blah; say $hex2; say $hex eq $hex2 ? "hexes equal" : "hexes not equal"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS! [download] OUTPUTS: `53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.7 +4.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 Wide character in say at ./test.pl line 16. Screenshot-2024-02-23-at-1.05.14 AM.png 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.7 +4.2D.31.2E.30.35.2E.31.34.26.23.38.32.33.39.3B.41.4D.2E.70.6E.67 hexes not equal` [download] $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply] [d/l] [select]
Re^12: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by ikegami (Patriarch) on Aug 13, 2024 at 18:02 UTC
You used the 7 character string ` ` (26.23.38.32.33.39.3B) instead of a NNBSP (202F) in your program X_X. I added a program to my previous post.	[reply] [d/l]
Re^13: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 18:23 UTC
Re^10: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by Corion (Patriarch) on Aug 13, 2024 at 17:39 UTC
Why do you use literals in your source code? You have the UTF-8 in your source code: `my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png";` [download] ... but you never tell Perl that your source code should be seen as utf8. Don't use UTF-8 in your source code unless you also tell Perl about it. Also, you will have noted already that your regular expression matches on the `decode`d filename.	[reply] [d/l] [select]
Re^11: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 17:49 UTC
PerlMonks website added them in. I just forgot to delete them. But they are narrow non-breaking space characters. $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply]
Re^12: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by Corion (Patriarch) on Aug 13, 2024 at 17:52 UTC
Yes, but they are what I mean by "literals". If you have any byte above 127 in your source code, you need to tell Perl what encoding your source code is in if it is not Latin-1. You have something that is UTF-8, but you are not telling Perl that your source code contains UTF-8.	[reply]
Re^13: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 18:21 UTC