in reply to Unicode regular expressions

I am trying to come up with a regular expressions that will match against a string allowing regular letters, hyphens, unicode letters, numbers, spaces, newlines (\n or \r\n) but no punctuation of any sort.
That's fairly trivial, once you know what you want to match, and what you don't want to. You say "hyphen" but "no punctuation of any sort". But a hyphen is punctuation of some sort. And in Unicode, there are many kinds of dashes. And what do you mean by "punctuation"? Do you consider a WHITE FROWNING FACE to be punctuation? What about a SNOWMAN? As for 'letters', Unicode defines what it considers 'letters'. Does that match your idea of letters? And numbers, do you mean digits? Anything numerical? And what are "spaces" in your definition? All 20+ spaces in the Unicode standard? Probably not, because that includes all the various linelines, and you mention them explicitly.

In short, your definition of what you want to match and what you don't is too vague to do anything with. And once it's exact, writing the regexp is easy.

Replies are listed 'Best First'.
Re^2: Unicode regular expressions
by SilasTheMonk (Chaplain) on Dec 09, 2009 at 20:58 UTC
    I have had to deprioritize this particular projectfor now but the answers so far contain a lot of useful information and experience which I will need to study. The main point is that people are picking up on my choice of requirements. If they are vague that might be a good thing, seeing as each interpretation of my requirements might elicit more useful information. However I can clarify. My test was rarely that the regular expression should accept "księgowość" but reject "£$%%^&". I was surprised at how hard this was. More generally I was hoping the regular expression would capture "reasonable search terms". As such I would regard a Chinese sentence as valid but an emoticon character as invalid.
      Oh, you want to recognize words. You know, you don't have to leave the ASCII realm to realize that that is more tricky than just matching letters and not matching punctuation symbols. Not matching punctuation symbols means rejecting "don't" as a word.

      As for matching Unicode letters, we have:

          "ญᴥ一ךى" =~ /^\p{L}+$/
      
      which is a sequence of (Unicode) letters, but from 5 different scripts. Do you want to match that?

      And then I haven't touch the can of worms called 'combining sequences'. Many (all?) of the accented Unicode characters can also be formed by taking the base character, and adding the various decorations to them. Not to mention that most combinations of a base character and decorations don't have a Unicode code point, and will have to be made by combining sequences.