comment on

There are few best practices, which might or might not answer your question:

Don't match character ranges. You will forget some. For example is there are a good reason to match [a-zA-Z], but not all those other Latin characters out there? Unicode contains more than 100k characters. Enumerating a subset of them is bound to fail, unless you have very narrow ideas about your subset.
Don't match Unicode blocks. They are just organizational units, nothing that the user or programmer should ever care about
If you want to check for Letter, Digits etc. use the appropriate Unicode property (a list can be found in perlunicode), like \p{LowercaseLetter} or short \p{Ll} (though the longer form is probably better readable)
If you want to check for a script, use constructs like \p{Hiragana}.
Remeber that there might be diacritic markings that belong conceptually to a different script, so instead of \p{YourScript}+ you might want to check for \p{YourScript}(?:\p{Mark}|\p{YourScript})*.
When counting characters, use \X rather than . in regexes.

(Disclaimer: I assume you deal with human language. For file formats or other artificial stuff it may very well be appropriate to do things that I recommended against above).

In reply to Re: Modern best practices for multilingual regexp alphabetical character matching? by moritz
in thread Modern best practices for multilingual regexp alphabetical character matching? by dmorgo

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Don't ask to ask, just ask
	PerlMonks