in reply to Re: OSCON Perl Unicode Slides
in thread OSCON Perl Unicode Slides

The ASCII alphabet (excluding non-alpha) has the alpha characters of 'a' .. 'z', 'A' .. 'Z'. There is the classic example of Jalapeno (which is written stupidly) versus Jalapeño, which is written correctly but doesn't fit into ASCII. This opens a big debate as to whether Jalapeño is an English word, and the short answer is that any word commonly used in English constitutes an English word (at least that's what many people assert).

Who am I to say that Björn Gunnlaugsson should change his name to Bjorn when he purchases a wallet at JC Penny? Yet that's what happens when the guy who programs POS terminals doesn't consider names that contain non-ASCII characters.

How dumb does it look to type '/' when we mean to use the obelus (÷) symbol? That isn't included in ASCII. Neither is the cent symbol (¢). Sure, we have our dumb workarounds like '/' and $0.01, and you might argue they're not strictly English, but they are what makes a document appear well edited versus typed by some guy at 2am on the Internewebs.

Another example is found in the Wikipedia Æ entry:

In English, usage of the ligature varies in different places. In modern typography, and where technological limitations make its use difficult (such as in use of typewriters), æ is often eschewed in favor of the digraph ae. This is often considered incorrect especially when rendering foreign words where æ is considered a letter (e.g. Æsir, Ærø) or brand names which make use of the ligature (e.g. Æon Flux, Encyclopædia Britannica). In the United States, the problem of the ligature is sidestepped in many cases by use of a simplified spelling with "e"; compare the common usage, medieval, with the traditional mediæval. However, given the long history of such spellings, they are sometimes used to invoke archaism or in literal quotations of historic sources; for instance, words such as dæmon are often treated in this way. Often, it will be replaced with a simple "ae" as in archaeology.

Update: I should have expected this to escalate, so please let me try to douse the fire. This post was an attempt (however inadequate it may have been) to explain the arguments that I have seen tchrist present to support his assertion that was the basis for Tanktalus's question. Whether the quality of my examples rose to the level that he might have presented is an area I will admit shortcoming. And I didn't even intend to start an argument as to the merits of his assertion either. I was just trying to give a few examples of what he's talking about. Though I tend to agree with the principle that ASCII text is only a subset of the characters needed to gracefully express a language, it's an endless and pointless debate. Pointless because Unicode is here to address the issues, and endless because Unicode isn't going away.


Dave

Replies are listed 'Best First'.
Re^3: OSCON Perl Unicode Slides
by Tanktalus (Canon) on Jul 25, 2011 at 19:29 UTC

    Thanks. I'm not sure I buy all of it, but it's still something to chew on.

    Specifically, "Björn" is not English (but may need to be legitimately processed by software that is otherwise English-only). The last time I've handwritten ÷ instead of putting numerator over denominator (approximated by a slash as much as the slash in %), I can't remember. Must have been grade school, and even then, likely pretty early in grade school. I actually find ÷ to be weird. :-) And, of course, 1¢ is meaningless today :-) I see very few items that are less than one dollar anymore, actually seeing the cent symbol seems like an anachronism.

    As for the Æ bit, well, "rendering foreign words" means "not English". The legitimate part is dealing with foreign brand names - again, it's not English, but may need to be legitimately processed by software that is otherwise English-only. The rest of your quoted text shows that English has largely moved away from using the ligature, and moved on to usually using ae instead. In my experience, even words that natively would have had accents and such on them usually lose them when misappropriated by English, such as your Jalapeno, or Jim's resume (a single spelling of a word can have multiple pronunciations, without even needing to have different meanings! - think 'po TAY toe'/'po TAH toe')

    Again, thanks. I was too much in a box here, which was uncomfortable because I usually like thinking outside boxes. I think that if tchrist's mini-rant weren't so over the top, and instead focused on why that's the case, I may have been a bit less confused. In fact, that'd be probably the main critique here for me: spend less time describing how bad something is and instead focus on why something is bad. Perhaps those who attend the talk will hear the reason why things are bad, but those of us who only get to see the slides may miss out :-)

      "Björn" is not English

      Names are universal. Anyway, "coöperate" is English (if a bit archaic). And anyway, we've already debated this at length. I'm thinking we shouldn't have to again.

      The bottom line for you, I think, is that it doesn't really matter what one specific language one document is written in. You will have to handle multi-lingual data, and that means Unicode.

      I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.
Re^3: OSCON Perl Unicode Slides
by BrowserUk (Patriarch) on Jul 25, 2011 at 19:52 UTC
    How dumb does it look to type '/' when we mean to use the obelus (÷) symbol?

    What a fatuous argument. Have you *ever* seen a mathematician (outside of an infants (grade) school) write 7 ÷ 3 instead of 7/3?

    Of course you haven't.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      In Germany, division is written as

      22 : 7 = 3,1428...

      Of course, when not restricted to a single line, the fractional notation is more useful:

      22
      = 3,1428...
       7
        In Germany, division is written as 22 : 7

        Are you sure?

        In England, ratios are often also denoted using ':'. Eg. 4:3, 16:9 (screen ratios); 3:1, 2:1 on (betting odds etc.) But these are all whole number ratios. You would never see 1 : 3.141592653. That would always be 1/3.141592653.

        I don't know much about German mathematics (apart from they've historically led the world at it), but in the few months I worked there, I never saw division written or typed as x:y, always x/y, unless it was a whole number ratio. Is my memory flawed? Or were my co-workers and correspondence simply accommodating the quaint Englishman?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.