Dear Monks,
Could someone please explain the following output?
#!/usr/bin/perl -w use strict; use Encode; use Data::Dumper; my $string1 = "blëh"; my $string2 = "blëhh"; my $string3 = "blëh.txt"; $string1 = Encode::decode(utf8 => $string1); $string2 = Encode::decode(utf8 => $string2); $string3 = Encode::decode(utf8 => $string3); $Data::Dumper::Useqq = 1; print Dumper $string1, $string2, $string3; print "matches1\n" if ($string1 =~ /^[\w\s.]+$/); print "matches2\n" if ($string2 =~ /^[\p{Word}]+$/); print "matches3\n" if ($string3 =~ /^[\p{L}\p{M}\p{N}.]+$/); ##### output ##### $VAR1 = "bl"; $VAR2 = "bl\x{fffd}hh"; $VAR3 = "bl\x{fffd}h.txt"; matches1
(Perl version is 5.8.4)
As far as I can see, string1 should not match but does (look at the weird Dumper output), while string2 and string3 don't match, but should.
In reply to utf weirdness in regex by december
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |