I am trying to come up with a regular expressions that will match against a string allowing regular letters, hyphens, unicode letters, numbers, spaces, newlines (\n or \r\n) but no punctuation of any sort.

use charnames qw( :full ); my $s = "ksi\N{LATIN SMALL LETTER E WITH OGONEK}" . "gowos\N{LATIN SMALL LETTER S WITH ACUTE}" . "c\N{LATIN SMALL LETTER C WITH ACUTE}"; print $s =~ /^(?:\r\n|[\p{Alnum} \n-])*\z/ ? "match\n" : "no match\n";
match

What does \X have to do with that? Is it that the string is (at least partially) decomposed?

use charnames qw( :full ); my $s = "ksie\N{COMBINING OGONEK}gowo" . "s\N{COMBINING ACUTE ACCENT}" . "c\N{COMBINING ACUTE ACCENT}"; print $s =~ /^(?:\r\n|[\p{Alnum} \n-])*\z/ ? "match\n" : "no match\n";
match

But that also matches. (ok, that surprised me)


On decomposed characters,

For any one who doesn't know, some of what you perceive as a character can actually be represented by multiple combinations of Unicode characters. Take "é", for example. It can be made of the character "é" or by the character "e" followed by combining acute accent character (U+0301). Here are two forms for the string provided by the OP (fully composed and fully decomposed):

use Unicode::Normalize qw( normalize ); use charnames qw( ); my $s = "ksi\x{0119}gowo\x{015B}\x{0107}"; for (qw(NFC NFD)) { print "$_\n"; printf("U+%04X: %s\n", $_, charnames::viacode($_)) for map ord, split //, normalize($_, $s); print("\n"); }
NFC U+006B: LATIN SMALL LETTER K U+0073: LATIN SMALL LETTER S U+0069: LATIN SMALL LETTER I U+0119: LATIN SMALL LETTER E WITH OGONEK U+0067: LATIN SMALL LETTER G U+006F: LATIN SMALL LETTER O U+0077: LATIN SMALL LETTER W U+006F: LATIN SMALL LETTER O U+015B: LATIN SMALL LETTER S WITH ACUTE U+0107: LATIN SMALL LETTER C WITH ACUTE NFD U+006B: LATIN SMALL LETTER K U+0073: LATIN SMALL LETTER S U+0069: LATIN SMALL LETTER I U+0065: LATIN SMALL LETTER E U+0328: COMBINING OGONEK U+0067: LATIN SMALL LETTER G U+006F: LATIN SMALL LETTER O U+0077: LATIN SMALL LETTER W U+006F: LATIN SMALL LETTER O U+0073: LATIN SMALL LETTER S U+0301: COMBINING ACUTE ACCENT U+0063: LATIN SMALL LETTER C U+0301: COMBINING ACUTE ACCENT

\X is used to match a "visual character". Back to our example, Both

"\N{LATIN SMALL LETTER E WITH ACUTE}" =~ /^\X\z/
and
"e\N{COMBINING ACUTE}" =~ /^\X\z/
will match.

(By the way \X doesn't match everything it should. This will be fixed in 5.12.1.)


In reply to Re: Unicode regular expressions (Decomposed) by ikegami
in thread Unicode regular expressions by SilasTheMonk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.