in reply to Re^2: Unicode vulgar fraction composition
in thread Unicode vulgar fraction composition

Sure, I think it's intuitive why lc('Boaty McBoat') is conceptually a "lossy" transformation (in terms of being able to restore the original string).

But NFKC("\N{VULGAR FRACTION THREE EIGHTHS}") is conceptually "lossless": there is only one Unicode character the resultant string "3\N{FRACTION SLASH}8" could be "composed" into.

As I wrote, I get now why NFKC is conceptually lossy in general. But—unlike with lc—some specific decompositions are exceptions.

Replies are listed 'Best First'.
Re^4: Unicode vulgar fraction composition
by soonix (Chancellor) on Oct 05, 2020 at 08:43 UTC
    consider:
    • 123\N{FRACTION SLASH}8
    • 12\N{VULGAR FRACTION THREE EIGHTHS}
    I would read the former as "one hundred twenty three eights", but the latter as "twelve (plus) three eights", so it's not completely a one-to-one relationship.

      Yes, my understanding is that's how Unicode would have you interpret each of those.

      So the problem then becomes that running NFKC on the latter produces the former: a nonequivalent string, therefore erroneous output. The correctly decomposed form of "12\N{VULGAR FRACTION THREE EIGHTHS}" would be, I presume, "12\N{ZERO WIDTH NON-JOINER}3\N{FRACTION SLASH}8". (Whether this is a bug or merely a "gotcha" in NFKC I suppose is a matter of interpretation.)

      But point taken that context matters when composing vulgar fractions.

Re^4: Unicode vulgar fraction composition
by ikegami (Patriarch) on Oct 06, 2020 at 04:14 UTC

    There's no way to know that 3/8 means three-eights. For example, it could mean March 8th. As such there are two possible compositions for 3/8: VULGAR FRACTION THREE EIGHTHS and 3/8.

      Absolutely true if (as you wrote) a U+002F SOLIDUS appears between the 3 and the 8. This is why I've been limiting my scope to the case where a U+2044 FRACTION SLASH appears between them, i.e., the specific sequence that NFKC or NFKD decomposes a Unicode vulgar fraction into.

        My mistake.

        Then yeah, one could possibly argue that this should be have been a standard decomposition rather than a compatibility decomposition. But they'd be wrong.

        A program is free to switch between the NFC and the NFD of a string at any time. As such, they should be visually and semantically indistinguishable. In other words, the two forms are simply two different ways of encoding graphemes internally.

        Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. For example, ‹b› and ‹d› are distinct graphemes in English writing systems because there exist distinct words like big and dig. Conversely, a lowercase italiform letter a and a lowercase Roman letter a are not distinct graphemes because no word is distinguished on the basis of these two different forms. (2) What a user thinks of as a character.

        "7/8" isn't a grapheme[1], much less the same one as "⅞". As such, the two strings could have different appearances or meanings, and it's easy to come up with an example where someone might intentionally use "7/8" over "⅞". Imagine a document containing "... between 7/8 and 15/16 of the ...". The author might purposefully not use "⅞" for stylistic consistency. It would not be proper for a program to automatically convert "7/8" to "⅞" wherever it occurs.

        The short version is that noone can guess what transformations you want to perform, so it's up to you to determine the rules you want to follow, which is to say write a program that does what you want. Do you want to change "7/8" into "⅞" unconditionally? conditionally? What about LATIN CAPITAL LETTER A WITH RING ABOVE (Å). Is there a time it should become ANGSTROM SIGN (Å)? etc These are decisions for you to take.


        1. Note I used a normal slash instead of a FRACTION SLASH throughout this post to avoid confusion because my browser rendered fractions with a FRACTION SLASH much like "⅞", and yours might to. But it is under no obligation to do so, and other renders won't do this.