Re^8: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl?

Replies are listed 'Best First'.
Re^9: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 18:09 UTC
Corion, it looks like you are right after all. I need to also change how the hex codes were printed and use `sprintf "%vX"` Wow, this is all so confusing. But I'm learning (hopefully). Thank you! $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply] [d/l]
Re^9: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 17:33 UTC
Thanks, but I'm not sure I follow. I tried applying decode to file names: #! /usr/bin/env perl use v5.36; use Encode 'decode'; # get all the files in the current directory my @files = map { decode 'UTF-8', $_ } glob(""); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $ss = $files[0]; my $hex = unpack("H", $ss); say $hex; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my $hex2 = unpack("H*", $blah); say $hex2; say $hex eq $hex2 ? "hexes equal" : "hexes not equal"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS! [download] OUTPUTS: `Character in 'H' format wrapped in unpack at ./test.pl line 12. 53637265656e73686f742d323032342d30322d32332d61742d312e30352e31342f414d +2e706e67 Wide character in say at ./test.pl line 15. Screenshot-2024-02-23-at-1.05.14 AM.png 53637265656e73686f742d323032342d30322d32332d61742d312e30352e3134e280af +414d2e706e67 hexes not equal` [download] $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply] [d/l] [select]
Re^10: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by ikegami (Patriarch) on Aug 13, 2024 at 17:43 UTC
And we're back to the original question (answered here): You're missing `use utf8;`. Also, you should be using `sprintf "%vX", $_` instead of `unpack "H", $_`. The former handles any strings. The latter only handles strings of bytes (strings where the characters are no higher than 0xFF), so it's definitely inappropriate here. #!/usr/bin/perl use v5.36; use warnings; # Source code encoded using UTF-8. use utf8; # Terminal provides/expects UTF-8 (for `say`). use open ":std", ":encoding(UTF-8)"; use Encode qw( decode_utf8 ); my $base = "Screenshot-2024-02-23-at-1.05.14\x{202F}AM.png"; my $lit = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my @files = map { decode_utf8 $_ } glob( "" ); my ( $file ) = grep { /^Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $base_hex = sprintf "%vX", $hex; for ( $base, $lit, $file ) { say $_; say $_ eq $base ? "same" : "different"; my $hex = sprintf "%vX", $_; say $hex; say $hex eq $base_hex ? "same" : "different"; } Output: Screenshot-2024-02-23-at-1.05.14 AM.png same 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 same Screenshot-2024-02-23-at-1.05.14 AM.png same 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 same Screenshot-2024-02-23-at-1.05.14 AM.png same 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.74.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 same	[reply] [d/l] [select]
Re^11: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 18:06 UTC
Wait, SORRY! You are right! Looks like I added a stray semi colon to the name of the file in $blah and so script was still failing. Holy crap I'm an idiot. Using `utf8` does get the two hex dumps to match now. Ok, now to wrap my head around all this. Jesus. THANK YOU! $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply]
Re^11: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 17:55 UTC
It makes no difference if if I use utf8 or not (and I thought using us v5.36 set utf8 out of the box, anyway). It sill fails. Also, same result with sprintf "%vX", just slightly different output. Try it: #! /usr/bin/env perl use v5.36; use utf8; use Encode 'decode'; # get all the files in the current directory my @files = map { decode 'UTF-8', $_ } glob("*"); my ($file) = grep { /Screenshot-2024-02-23-at-1.05.14\s/ } @files; my $ss = $files[0]; my $hex = sprintf "%vX", $ss; say $hex; say $file; # ERROR! my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png"; my $hex2 = sprintf "%vX", $blah; say $hex2; say $hex eq $hex2 ? "hexes equal" : "hexes not equal"; say $blah =~ /Screenshot-2024-02-23-at-1.05.14\s/; # WORKS! [download] OUTPUTS: `53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.7 +4.2D.31.2E.30.35.2E.31.34.202F.41.4D.2E.70.6E.67 Wide character in say at ./test.pl line 16. Screenshot-2024-02-23-at-1.05.14 AM.png 53.63.72.65.65.6E.73.68.6F.74.2D.32.30.32.34.2D.30.32.2D.32.33.2D.61.7 +4.2D.31.2E.30.35.2E.31.34.26.23.38.32.33.39.3B.41.4D.2E.70.6E.67 hexes not equal` [download] $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply] [d/l] [select]
Re^12: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by ikegami (Patriarch) on Aug 13, 2024 at 18:02 UTC
Re^13: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 18:23 UTC
Re^10: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by Corion (Patriarch) on Aug 13, 2024 at 17:39 UTC
Why do you use literals in your source code? You have the UTF-8 in your source code: `my $blah = "Screenshot-2024-02-23-at-1.05.14 AM.png";` [download] ... but you never tell Perl that your source code should be seen as utf8. Don't use UTF-8 in your source code unless you also tell Perl about it. Also, you will have noted already that your regular expression matches on the `decode`d filename.	[reply] [d/l] [select]
Re^11: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 17:49 UTC
PerlMonks website added them in. I just forgot to delete them. But they are narrow non-breaking space characters. $PM = "Perl Monk's"; $MC = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ ~~Vicar~~ Parson"; $nysus = $PM . ' ' . $MC; Click here if you love Perl Monks	[reply]
Re^12: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by Corion (Patriarch) on Aug 13, 2024 at 17:52 UTC
Re^13: Any good ways to handle NARROW NO-BREAK SPACE characters in regex in newer versions of Perl? by nysus (Parson) on Aug 13, 2024 at 18:21 UTC