in reply to Re^5: Special & Accented chars in nodes titles ==> [à la française] (design)
in thread Special & Accented chars in nodes titles ==> [à la française]

Actually, I'm fairly certian that mysql thinks the data is in latin-1. OTOH, it doesn't really matter what mysql thinks, since it only makes a difference when using regexes and case-insensitive like, and to the best of my knowladge, the first is only usable by saints and gods in super search (who could live with it being wrong), and the later is not used at all.

Of much larger importance is perl itself, and the version of perl being used here has poor utf8 support (5.6.1), and the fact that we'd have to figure out how to guess what's being sent, and recode all over the place to support non-latin-1 content.

If you find a place where the browser appears to send utf8 (not entities) instead of latin-1, please, point me at it! It may be sending %uXXXX URI escapes, which CGI.pm incorrectly interprets. I've been trying to find a test-case for this for some time. (If, OTOH, it really sends utf8, your plan, or at least what I suspect your plan is, is probably the best way to do it.)

  • Comment on Re^6: Special & Accented chars in nodes titles ==> [à la française] (design)

Replies are listed 'Best First'.
Re^7: Special & Accented chars in nodes titles ==> [à la française] (detecting)
by tye (Sage) on Jun 30, 2004 at 07:57 UTC

    Actually, it does matter in a lot more places and you are correct:

    select 'é' = 'É'

    returns "1". Thanks for the info.

    Anyway, my idea was to update 'startform' so that all of our forms contain a hidden field something like enc="éñÇ" so we could tell if UTF-8 is coming in from a form. Your %u... stuff will catch some other cases. Checking for Content-Encoding headers coming *in* might catch more. Probably still not 100% coverage, but pretty good.

    - tye        

      Oooh, that's a (well, several) much better ideas then I'd thought of -- I was thinking of checking if it's valid utf8, and if it is, assuming that it was, indeed, UTF8. (This is not as bad as it may appear -- in purticular, plain ole ASCII text is valid utf8, and valid latin-1, with exactly the same meaning, so it doesn't matter. Latin-1 that uses high-half characters is unlikely, from a linguistic standpoint, AFAIK, to be vaild utf8.)

      I really like the enc="éñÇ" idea, though.