I am trying to come up with a regular expressions that will match against a string allowing regular letters, hyphens, unicode letters, numbers, spaces, newlines (\n or \r\n) but no punctuation of any sort.
use charnames qw( :full ); my $s = "ksi\N{LATIN SMALL LETTER E WITH OGONEK}" . "gowos\N{LATIN SMALL LETTER S WITH ACUTE}" . "c\N{LATIN SMALL LETTER C WITH ACUTE}"; print $s =~ /^(?:\r\n|[\p{Alnum} \n-])*\z/ ? "match\n" : "no match\n";
match
What does \X have to do with that? Is it that the string is (at least partially) decomposed?
use charnames qw( :full ); my $s = "ksie\N{COMBINING OGONEK}gowo" . "s\N{COMBINING ACUTE ACCENT}" . "c\N{COMBINING ACUTE ACCENT}"; print $s =~ /^(?:\r\n|[\p{Alnum} \n-])*\z/ ? "match\n" : "no match\n";
match
But that also matches. (ok, that surprised me)
On decomposed characters,
For any one who doesn't know, some of what you perceive as a character can actually be represented by multiple combinations of Unicode characters. Take "é", for example. It can be made of the character "é" or by the character "e" followed by combining acute accent character (U+0301). Here are two forms for the string provided by the OP (fully composed and fully decomposed):
use Unicode::Normalize qw( normalize ); use charnames qw( ); my $s = "ksi\x{0119}gowo\x{015B}\x{0107}"; for (qw(NFC NFD)) { print "$_\n"; printf("U+%04X: %s\n", $_, charnames::viacode($_)) for map ord, split //, normalize($_, $s); print("\n"); }
NFC U+006B: LATIN SMALL LETTER K U+0073: LATIN SMALL LETTER S U+0069: LATIN SMALL LETTER I U+0119: LATIN SMALL LETTER E WITH OGONEK U+0067: LATIN SMALL LETTER G U+006F: LATIN SMALL LETTER O U+0077: LATIN SMALL LETTER W U+006F: LATIN SMALL LETTER O U+015B: LATIN SMALL LETTER S WITH ACUTE U+0107: LATIN SMALL LETTER C WITH ACUTE NFD U+006B: LATIN SMALL LETTER K U+0073: LATIN SMALL LETTER S U+0069: LATIN SMALL LETTER I U+0065: LATIN SMALL LETTER E U+0328: COMBINING OGONEK U+0067: LATIN SMALL LETTER G U+006F: LATIN SMALL LETTER O U+0077: LATIN SMALL LETTER W U+006F: LATIN SMALL LETTER O U+0073: LATIN SMALL LETTER S U+0301: COMBINING ACUTE ACCENT U+0063: LATIN SMALL LETTER C U+0301: COMBINING ACUTE ACCENT
\X is used to match a "visual character". Back to our example, Both
and"\N{LATIN SMALL LETTER E WITH ACUTE}" =~ /^\X\z/
will match."e\N{COMBINING ACUTE}" =~ /^\X\z/
(By the way \X doesn't match everything it should. This will be fixed in 5.12.1.)
In reply to Re: Unicode regular expressions (Decomposed)
by ikegami
in thread Unicode regular expressions
by SilasTheMonk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |