I don't really care how many bytes are needed to STORE a codepoint, but if codepoint doesn't equal character (sorry, grapheme) then there is really no point in using that concept. In UTF-8 (and I believe in most other ways to encode Unicode) some codepoints are longer than others. Some bytes (or byte sequences) are reserved to mean something like "waitasecond ... we are not done yet, read the next N more bytes". Which means I can't asume a byte is a character, I have to understand those reserved codes and take enough bytes to have the whole "character". With Codepoint != Character (OK, grapheme, that sounds cooler) this one level of "you have to understand the data to be able to even just say how long it is" is not enough, 'cause there's yet another level on top of that.

Sorry, that sounds rather ... erm ... interesting.

Separating a grapheme into a base character and a decoration makes some sense in DTP. In case your font doesn't have that character you may construct it out of something it does. (TeX used to do that for Czech accented characters at first.) For anything other it's overcomplication.

I would rather have the list of Unicode codepoints grow (which I believe it does anyway) and KNOW where does a character start and end in the stream of bytes.

Jenda
P.S.: If you overcomplicate you loose correctness as well. Because noone will bother to implement all that stuff correctly.
Support Denmark!
Defend the free world!


In reply to Re^4: [RFC] How to reverse a (text) string by Jenda
in thread [RFC] How to reverse a (text) string by moritz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.