Re^2: Unicode regular expressions

I have had to deprioritize this particular projectfor now but the answers so far contain a lot of useful information and experience which I will need to study. The main point is that people are picking up on my choice of requirements. If they are vague that might be a good thing, seeing as each interpretation of my requirements might elicit more useful information. However I can clarify. My test was rarely that the regular expression should accept "księgowość" but reject "Ł$%%^&". I was surprised at how hard this was. More generally I was hoping the regular expression would capture "reasonable search terms". As such I would regard a Chinese sentence as valid but an emoticon character as invalid.

Comment on Re^2: Unicode regular expressions

Replies are listed 'Best First'.
Re^3: Unicode regular expressions by JavaFan (Canon) on Dec 09, 2009 at 22:05 UTC
Oh, you want to recognize words. You know, you don't have to leave the ASCII realm to realize that that is more tricky than just matching letters and not matching punctuation symbols. Not matching punctuation symbols means rejecting `"don't"` as a word. As for matching Unicode letters, we have: "ญᴥ一ךى" =~ /^\p{L}+$/ which is a sequence of (Unicode) letters, but from 5 different scripts. Do you want to match that? And then I haven't touch the can of worms called 'combining sequences'. Many (all?) of the accented Unicode characters can also be formed by taking the base character, and adding the various decorations to them. Not to mention that most combinations of a base character and decorations don't have a Unicode code point, and will have to be made by combining sequences.	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Unicode regular expressions
by JavaFan (Canon) on Dec 09, 2009 at 22:05 UTC

"don't"

As for matching Unicode letters, we have:

    "ญᴥ一ךى" =~ /^\p{L}+$/

And then I haven't touch the can of worms called 'combining sequences'. Many (all?) of the accented Unicode characters can also be formed by taking the base character, and adding the various decorations to them. Not to mention that most combinations of a base character and decorations don't have a Unicode code point, and will have to be made by combining sequences.

[reply]
[d/l]