in reply to Re^2: Special & Accented chars in nodes titles ==> [à la française] (!ents)
in thread Special & Accented chars in nodes titles ==> [à la française]

nice to have PerlMonks/Everything encoding behaviour

This is specific to PerlMonks. I don't know what other Everything installations use these days, but PerlMonks used to interpret titles as HTML until I fixed it because it was causing problems and had the potential for even more abuses.

- tye        

  • Comment on Re^3: Special & Accented chars in nodes titles ==> [à la française] (!ents)

Replies are listed 'Best First'.
Re^4: Special & Accented chars in nodes titles ==> [à la française] (!ents)
by dfaure (Chaplain) on Jun 28, 2004 at 22:08 UTC

    Well, to sum up all that stuff, it seems that PM was initially designed with html in mind then patched several times ending up to support latin-1 encoding on input/output but nothing else. Do I am right?

    I suspect the storage of PM made with default table charsets (which is latin-1). Do I am right again?

    ____
    HTH, Dominique
    My two favorites:
    If the only tool you have is a hammer, you will see every problem as a nail. --Abraham Maslow
    Bien faire, et le faire savoir...

      Well, to sum up all that stuff, it seems that PM was initially designed with html in mind then patched several times ending up to support latin-1 encoding on input/output but nothing else. Do I am right?

      Well, PM was designed with pseudo-HTML in mind and still uses it (but not in node titles). As for the contents of node titles, I find more evidence that the originally design was not for them to be interpretted as HTML. I think that they were either designed to be text or that that part of the design just wasn't fully specified or fully considered. There were similar parts that should have been escaped and simply broke things in some cases so I don't think I'm stretching to guess that the titles were not escaped for similar reasons (a very common mistake that I've made many times and I've seen others make many times).

      I suspect the storage of PM made with default table charsets (which is latin-1). Do I am right again?

      No, the storage of PM nodes is encoding-agnostic, AFAICT. It just stores byte strings without bothering with encodings. And I'm glad.

      BTW, if you look at your node's title, you'll notice that your accented characters are no longer correct. This is due to what I mentioned above; your browser is sending UTF-8 text to PerlMonks. Luckily, this prompted me to realize that there is a simple way that we can detect this. Now I just need to write conversion code (and I think a regex will be easier than porting Encode to PerlMonks, but we'll see).

      In the mean time, if you are going to write French at PerlMonks, you'll need to use HTML entities for accented characters in the text and use a different browser to get accented characters in the titles (if this is a big hardship for you, maybe someone will volunteer to clean up your titles for you, though that work may have to be done every time you update a node).

      - tye        

        Actually, I'm fairly certian that mysql thinks the data is in latin-1. OTOH, it doesn't really matter what mysql thinks, since it only makes a difference when using regexes and case-insensitive like, and to the best of my knowladge, the first is only usable by saints and gods in super search (who could live with it being wrong), and the later is not used at all.

        Of much larger importance is perl itself, and the version of perl being used here has poor utf8 support (5.6.1), and the fact that we'd have to figure out how to guess what's being sent, and recode all over the place to support non-latin-1 content.

        If you find a place where the browser appears to send utf8 (not entities) instead of latin-1, please, point me at it! It may be sending %uXXXX URI escapes, which CGI.pm incorrectly interprets. I've been trying to find a test-case for this for some time. (If, OTOH, it really sends utf8, your plan, or at least what I suspect your plan is, is probably the best way to do it.)

        It just stores byte strings without bothering with encodings. And I'm glad.

        You should. IMHO, The choice of simplicity is always the best.

        Now I just need to write conversion code (and I think a regex will be easier than porting Encode to PerlMonks, but we'll see).

        Be careful not forget what you're glad for ;-). Ie. you'll need bi-directional encoding [browser (???)] <==> [storage (utf8)] here.

        you'll need to use HTML entities for accented characters in the text

        ...and Latin-1 for the title. I'll try to use proper encoding!

        ____
        HTH, Dominique
        My two favorites:
        If the only tool you have is a hammer, you will see every problem as a nail. --Abraham Maslow
        Bien faire, et le faire savoir...