Well, actually recently I had experience writing not a small application (20k lines), which allows unicode everywhere and handles unicode correctly.
But it does not need 90% of things that you listed
Probably because my application does not try to analyze text data (it only stores it, converts, compares, reencodes), it does not need sort nor fc-aware comparison
Your case is probably something that analyzes text (I can imagine now only something related to natural language processing or word processor or maybe a dictonary)
So I think different applications need different level of unicode support
Below some cases when policy you listed can be wrong in some circumstances:
lc($a) cmp/eq/ne/... lc($b) should be using fc. Same story with uc.
Something like a-z should often be \p{Ll} or \p{lower}
If you write, say, code which have to deal with parsing http headers (no, that's not reinvention of wheel, like HTTP library,
that can be a proxy server or REST library), then "cmp" and "a-z" would be correct choice, and fc() \p{lower} can introduce bugs (say, with "β" vs "ss").
Other examples can be unit tests where you usually have to deal with pre-defined data sets, or internal program metadata which is always plain ASCII,
or comparison of MD5/SHA hex values etc.
Opening a text file without stating its encoding somewhere or other is a recipe for failure.
Unless it's a binary file.
@lines = do { local $/; split /\R/, <INPUT> };
Hm. I think it's not correct to use something like U+2028 as line separator for files.
You need code like this if you read from
text file. Text file is something separated by LF or CRLF, other combinations are not portable.
If you are writing word processor which should handle U+2028 you should not mix this with system file IO, instead introduce your own logic when
you are spliting data to "lines" and paragraphs.
I don't see where this can be correct to mix "lines" from your word processor logic and lines of text file on disk (or socket)
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.