Final versions of each of the three talks I’m giving at OSCON on Unicode are now available, each coming in triplicate versions:
  1. The original POD source for grepping and hacking, and general delectation.
  2. HTML for spiffy slideshow viewing (but get the two fonts it recommends to do so, Alfios and especially Symbola). Tested under Safari on Darwin, purported to also work on recent versions of Firefox.
  3. PDF for font‐hassle–free printing.

Also, if you don’t yet have Unicode::Tussle, you might want to look at that, too.

Replies are listed 'Best First'.
Re: OSCON Perl Unicode Slides
by Tanktalus (Canon) on Jul 25, 2011 at 18:20 UTC

    Very informative, thanks. I do have one question, though. On page 49, you say "Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong." My apologies for being shortsighted/underinformed, but could you expound on this a bit more? In what ways?

    I ask because I seem to be writing the translation-handling code at $work, and will likely be the de facto developer dealing with translation issues, both from the translators and from developers, so this topic intrigues me greatly, and I'm trying to understand as much of it as possible in as little time as possible so that I can deal with issues as they come up.

    Thanks,

      The ASCII alphabet (excluding non-alpha) has the alpha characters of 'a' .. 'z', 'A' .. 'Z'. There is the classic example of Jalapeno (which is written stupidly) versus Jalapeño, which is written correctly but doesn't fit into ASCII. This opens a big debate as to whether Jalapeño is an English word, and the short answer is that any word commonly used in English constitutes an English word (at least that's what many people assert).

      Who am I to say that Björn Gunnlaugsson should change his name to Bjorn when he purchases a wallet at JC Penny? Yet that's what happens when the guy who programs POS terminals doesn't consider names that contain non-ASCII characters.

      How dumb does it look to type '/' when we mean to use the obelus (÷) symbol? That isn't included in ASCII. Neither is the cent symbol (¢). Sure, we have our dumb workarounds like '/' and $0.01, and you might argue they're not strictly English, but they are what makes a document appear well edited versus typed by some guy at 2am on the Internewebs.

      Another example is found in the Wikipedia Æ entry:

      In English, usage of the ligature varies in different places. In modern typography, and where technological limitations make its use difficult (such as in use of typewriters), æ is often eschewed in favor of the digraph ae. This is often considered incorrect especially when rendering foreign words where æ is considered a letter (e.g. Æsir, Ærø) or brand names which make use of the ligature (e.g. Æon Flux, Encyclopædia Britannica). In the United States, the problem of the ligature is sidestepped in many cases by use of a simplified spelling with "e"; compare the common usage, medieval, with the traditional mediæval. However, given the long history of such spellings, they are sometimes used to invoke archaism or in literal quotations of historic sources; for instance, words such as dæmon are often treated in this way. Often, it will be replaced with a simple "ae" as in archaeology.

      Update: I should have expected this to escalate, so please let me try to douse the fire. This post was an attempt (however inadequate it may have been) to explain the arguments that I have seen tchrist present to support his assertion that was the basis for Tanktalus's question. Whether the quality of my examples rose to the level that he might have presented is an area I will admit shortcoming. And I didn't even intend to start an argument as to the merits of his assertion either. I was just trying to give a few examples of what he's talking about. Though I tend to agree with the principle that ASCII text is only a subset of the characters needed to gracefully express a language, it's an endless and pointless debate. Pointless because Unicode is here to address the issues, and endless because Unicode isn't going away.


      Dave

        Thanks. I'm not sure I buy all of it, but it's still something to chew on.

        Specifically, "Björn" is not English (but may need to be legitimately processed by software that is otherwise English-only). The last time I've handwritten ÷ instead of putting numerator over denominator (approximated by a slash as much as the slash in %), I can't remember. Must have been grade school, and even then, likely pretty early in grade school. I actually find ÷ to be weird. :-) And, of course, 1¢ is meaningless today :-) I see very few items that are less than one dollar anymore, actually seeing the cent symbol seems like an anachronism.

        As for the Æ bit, well, "rendering foreign words" means "not English". The legitimate part is dealing with foreign brand names - again, it's not English, but may need to be legitimately processed by software that is otherwise English-only. The rest of your quoted text shows that English has largely moved away from using the ligature, and moved on to usually using ae instead. In my experience, even words that natively would have had accents and such on them usually lose them when misappropriated by English, such as your Jalapeno, or Jim's resume (a single spelling of a word can have multiple pronunciations, without even needing to have different meanings! - think 'po TAY toe'/'po TAH toe')

        Again, thanks. I was too much in a box here, which was uncomfortable because I usually like thinking outside boxes. I think that if tchrist's mini-rant weren't so over the top, and instead focused on why that's the case, I may have been a bit less confused. In fact, that'd be probably the main critique here for me: spend less time describing how bad something is and instead focus on why something is bad. Perhaps those who attend the talk will hear the reason why things are bad, but those of us who only get to see the slides may miss out :-)

        How dumb does it look to type '/' when we mean to use the obelus (÷) symbol?

        What a fatuous argument. Have you *ever* seen a mathematician (outside of an infants (grade) school) write 7 ÷ 3 instead of 7/3?

        Of course you haven't.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      I seem to be writing the translation-handling code at $work…

      This job of yours—how did you get it? By submitting a résumé, right? If you had submitted a resume, you probably wouldn't have gotten the job.

      (There are several different characters in this short post that are not among the 94 printable characters in the ASCII character set. Can you spot them?)

      Jim

Re: OSCON Perl Unicode Slides
by davido (Cardinal) on Jul 25, 2011 at 17:08 UTC

    In your first talk's slides make the following changes:

    • use 5.12; should be use 5.012;
    • use 5.14; should be use 5.014;

    I sure wish I were attending just for these talks alone!


    Dave

      davido wrote:
      In your first talk's slides make the following changes:
      • use 5.12; should be use 5.012;
      • use 5.14; should be use 5.014;

      Hm, I don’t think I say that. I’m pretty sure I say:

      =item At the top of your source file (program, module, 
      library, C<do>hickey), prominently assert that you are
      running perl version 5.12 or better via:
      
       use v5.12;  # minimal for unicode_strings feature
       use v5.14;  # optimal for unicode_strings feature
      

      Which seems just fine. Well, or close to it: I’m not too worried about the Perl 5 ⁵ ⁄ ₁₀₀₀ people. I hope that doesn’t seem too 💔 of me.

      I’m trying to inculcate them with the vVERNO style for the express purpose of avoiding just such an error.

      Thank you for looking through the slides. I did make a couple of minor updates today, and pushed my changes.

      On stage in the morning.

        I'm sure you're correct and I was blind. Does that put me in the 5/1000 category? :)

        Thanks for posting the slides.

        Update: Yes, you're correct. It looks like it will be a nice presentation.


        Dave

        Break a leg!

        Jim

Re: OSCON Perl Unicode Slides
by zentara (Cardinal) on Jul 25, 2011 at 16:45 UTC
    Nice set of material. I like the little camel font you use for representing Perl. :-)

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      More to the point, it’s the camel character—the new Unicode emoji character named DROMEDARY CAMEL (U+1F42A)—not just some “little camel font.” Ain’t it cool?

Re: OSCON Perl Unicode Slides
by Jim (Curate) on Jul 25, 2011 at 18:37 UTC
Re: OSCON Perl Unicode Slides
by zentara (Cardinal) on Jul 26, 2011 at 20:20 UTC
    Hey, not to pick a nit, but at the top of pue.pdf, you have the date wrong.

    OSCON · Tuesday, 28 July 2011, the other 2 pdf's say Thursday, 28

    I don't quite understand all the unicode but I do know what day of the week it is. ;-)


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: OSCON Perl Unicode Slides
by pid (Monk) on Jul 28, 2011 at 06:22 UTC

    In short: I simply love your writing style.
    Informative, yet fun to read. That's exactly what a newbie like me want.

    Thanks!

Re: OSCON Perl Unicode Slides
by BrowserUk (Patriarch) on Jul 25, 2011 at 18:39 UTC

    On page 49, you say:

    "Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong."

    I have a counter proposal.

    Anyone who condemns the collected published works of: Philip Larkin George Orwell William Golding Ted Hughes Doris Lessing J. R. R. Tolkien V. S. Naipaul Muriel Spark Kingsley Amis Angela Carter C. S. Lewis Iris Murdoch Salman Rushdie Ian Fleming Jan Morris Roald Dahl Anthony Burgess Mervyn Peake Martin Amis Anthony Powell Alan Sillitoe John Le Carré Penelope Fitzgerald Philippa Pearce Barbara Pym Beryl Bainbridge J. G. Ballard Alan Garner Alasdair Gray John Fowles Derek Walcott Kazuo Ishiguro Anita Brookner A. S. Byatt Ian McEwan Geoffrey Hill Hanif Kureishi Iain Banks George Mackay Brown A. J. P. Taylor Isaiah Berlin J. K. Rowling Philip Pullman Julian Barnes Colin Thubron Bruce Chatwin Alice Oswald Benjamin Zephaniah Rosemary Sutcliff Michael Moorcock

    (and that's just the last 70 years), as "stupid, short-sighted, illiterate, broken, evil, and wrong", for the sake of making a puerile argument in favour of their latest, greatest toy -- is a self-absorbed, blinkered, revisionist ....


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      One of the authors you listed can't even spell his own name using the ASCII character set.

      I challenge you to find a single, whole published work by any one of these authors that was printed entirely using the ASCII character set.

        I challenge you to find a single, whole published work by any one of these authors that was printed entirely using the ASCII character set.

        And I challenge you to show that none of them have.

        Counter arguement: Check out the 'charset' attribute of the 'Content=type' meta-tag of the html formats of any of the books at Project Gutenberg.

        Prediction: You are going to argue about what constitutes: "published".

        Bottom line: If you were instructed to "resume writing your resume", you would have no trouble in distinguishing la difference. Just as you had no trouble in hearing "dif-er-anse" instead of "diff-rence" as you read the last word of the previous sentence.

        Just as you will have no trouble distinguishing the salient words in:

        1. Messers Corbin & Son took the lead in the efficient smelting of lead.
        2. He was now so close that he could close the trap with barely a flick of his finger.
        3. Every day, come wind or rain, the old man climbed the steps of the exposed bell-tower to wind the ancient mechanism.
        4. Unable to bear the immense weight of the full-grown grizzly bear standing on his back, he groaned aloud. It was the last sound he would ever make.
        5. It had taken him 3 days to clear the weeds, turn the sod and sow the seed potatoes he'd resisted eating all summer. To see that the pregnant sow had undone all that work in less than an hour was heartbreaking.

        Unicode has its place, but revising history to make a point is stupid. There are good arguments for unicode, but bad arguments are just bad arguments, regardless of the subject.

        Overstating your case diminishes your credibility.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
          A reply falls below the community's threshold of quality. You may see it by logging in.