Re^2: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

Hmmm. Thanks. Maybe it's not the character I think it is:

> $ perl -wMstrict -lE "qq/Screenshot-2024-02-23-at-1.05.14&#8239;AM.p
+ng/ =~ /\s/ and say 'OK' or die" 
Died at -e line 1.
[download]

The #8239 popped in after submitting this post. It's not actually in the code.

#8239 is 202F in hex. I don't get this.

This doesn't even work:

> $ perl -wMstrict -lE "qq/Screenshot-2024-02-23-at-1.05.14&#8239;AM.p
+ng/ =~ /\x{202F}/ and say 'OK' or die"                               
+                                                                     
+                                                                     
+                                   
Died at -e line 1.
[download]

$PM = "Perl Monk's";
$MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson";
$nysus = $PM . ' ' . $MC;
Click here if you love Perl Monks

Comment on Re^2: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? Select or Download Code

Replies are listed 'Best First'.
Re^3: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by ikegami (Patriarch) on Aug 13, 2024 at 16:02 UTC
Even though you entered a NNBSP on the command line, your program doesn't contain a NNBSP. (And I'm not referring to the appearance of ` `. That's due to a PerlMonks limitation.) By default, Perl programs are expected to be encoded using ASCII. NNBSP isn't found in the ASCII character set, so your program can't possibly include a NNBSP. Assuming a UTF-8 terminal, what you actually provided Perl is equivalent to `"...\xE2\x80\xAF..."`. But a string containing a NNBSP would be `"...\x{202F}..."`. You can tell Perl that the program is encoded using UTF-8 by adding `use utf8;`.	[reply] [d/l] [select]
Re^4: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 16:16 UTC
OK, thank you! I'm getting closer but still confused AF. So this returns files as desired: `use utf8; my $image_name = 'Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF'; my $files = $wac->get_all_files_in_dir($dir . '/uploads', qr/$image_na +me/);` [download] But this still does not match: `use utf8; my $image_name = 'Screenshot-2024-02-23-at-1.05.14\s'; my $files = $wac->get_all_files_in_dir($dir . '/uploads', qr/$image_na +me/);` [download] I'm using neovim. It shows file is also encoded as UTF-8. $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply] [d/l] [select]
Re^5: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by ikegami (Patriarch) on Aug 13, 2024 at 16:28 UTC
`use utf8;` has no effect in that program since it's encoded using ASCII. But it doesn't hurt since ASCII is a subset of UTF-8. The issue is that `get_all_files_in_dir` is matching against the still-encoded file names. The second program is effectively doing `my $fn = "Screenshot-2024-02-23-at-1.05.14\xE2\x80\xAF"; $fn =~ /Screenshot-2024-02-23-at-1.05.14\s/` [download] That will only match if U+E2 is a space character, and it isn't.	[reply] [d/l] [select]
Re^6: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 16:39 UTC
Re^7: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by ikegami (Patriarch) on Aug 13, 2024 at 17:39 UTC
Re^7: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 17:05 UTC
Some notes below your chosen depth have not been shown here
Re^3: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by haukex (Archbishop) on Aug 13, 2024 at 16:00 UTC
The #8239 popped in after submitting this post. It's not actually in the code. Yeah, PerlMonks does that to Unicode characters in `<code>` blocks - see my node here. In that case, I would suspect an encoding error - see ikegami's reply and my node here.	[reply] [d/l]
Re^4: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 16:03 UTC
I copy and pasted the file name into a file and did a hex dump: `00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 012345678 +9ABCDEF 00000000 53 63 72 65 65 6E 73 68 - 6F 74 2D 32 30 32 34 2D Screensho +t-2024- 00000010 30 32 2D 32 33 2D 61 74 - 2D 31 2E 30 35 2E 31 34 02-23-at- +1.05.14 00000020 E2 80 AF 41 4D 2D 31 30 - 32 34 78 36 39 38 2E 70 ...AM-102 +4x698.p 00000030 6E 67 0A ng.` [download] E2 80 AF is the UTF8. I wonder if the acting of cutting and pasting is modifying the string. I'm using tmux. $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply] [d/l]