note
moritz
There are few best practices, which might or might not answer your question:
<ul>
<li>Don't match character ranges. You <b>will</b> forget some. For example is there are a good reason to match <c>[a-zA-Z]</c>, but not all those other Latin characters out there? Unicode contains more than 100k characters. Enumerating a subset of them is bound to fail, unless you have <i>very</i> narrow ideas about your subset.</li>
<li>Don't match Unicode blocks. They are just organizational units, nothing that the user or programmer should ever care about</li>
<li>If you want to check for Letter, Digits etc. use the appropriate Unicode property (a list can be found in [doc://perlunicode]), like <c>\p{LowercaseLetter}</c> or short <c>\p{Ll}</c> (though the longer form is probably better readable)</li>
<li>If you want to check for a script, use constructs like <c>\p{Hiragana}</c>.</li>
<li>Remeber that there might be diacritic markings that belong conceptually to a different script, so instead of <c>\p{YourScript}+</c> you might want to check for <c>\p{YourScript}(?:\p{Mark}|\p{YourScript})*</c>.</li>
<li>When counting characters, use <c>\X</c> rather than <c>.</c> in regexes.</li>
</ul>
<p>(Disclaimer: I assume you deal with human language. For file formats or other artificial stuff it may very well be appropriate to do things that I recommended against above).
735804
735804