There are few best practices, which might or might not answer your question:
- Don't match character ranges. You will forget some. For example is there are a good reason to match [a-zA-Z], but not all those other Latin characters out there? Unicode contains more than 100k characters. Enumerating a subset of them is bound to fail, unless you have very narrow ideas about your subset.
- Don't match Unicode blocks. They are just organizational units, nothing that the user or programmer should ever care about
- If you want to check for Letter, Digits etc. use the appropriate Unicode property (a list can be found in perlunicode), like \p{LowercaseLetter} or short \p{Ll} (though the longer form is probably better readable)
- If you want to check for a script, use constructs like \p{Hiragana}.
- Remeber that there might be diacritic markings that belong conceptually to a different script, so instead of \p{YourScript}+ you might want to check for \p{YourScript}(?:\p{Mark}|\p{YourScript})*.
- When counting characters, use \X rather than . in regexes.
(Disclaimer: I assume you deal with human language. For file formats or other artificial stuff it may very well be appropriate to do things that I recommended against above).