Well, to sum up all that stuff, it seems that PM was initially designed with html in mind then patched several times ending up to support latin-1 encoding on input/output but nothing else. Do I am right?
Well, PM was designed with pseudo-HTML in mind and still uses it (but not in node titles). As for the contents of node titles, I find more evidence that the originally design was not for them to be interpretted as HTML. I think that they were either designed to be text or that that part of the design just wasn't fully specified or fully considered. There were similar parts that should have been escaped and simply broke things in some cases so I don't think I'm stretching to guess that the titles were not escaped for similar reasons (a very common mistake that I've made many times and I've seen others make many times).
I suspect the storage of PM made with default table charsets (which is latin-1). Do I am right again?
No, the storage of PM nodes is encoding-agnostic, AFAICT. It just stores byte strings without bothering with encodings. And I'm glad.
BTW, if you look at your node's title, you'll notice that your accented characters are no longer correct. This is due to what I mentioned above; your browser is sending UTF-8 text to PerlMonks. Luckily, this prompted me to realize that there is a simple way that we can detect this. Now I just need to write conversion code (and I think a regex will be easier than porting Encode to PerlMonks, but we'll see).
In the mean time, if you are going to write French at PerlMonks, you'll need to use HTML entities for accented characters in the text and use a different browser to get accented characters in the titles (if this is a big hardship for you, maybe someone will volunteer to clean up your titles for you, though that work may have to be done every time you update a node).
| [reply] |
Actually, I'm fairly certian that mysql thinks the data is in latin-1. OTOH, it doesn't really matter what mysql thinks, since it only makes a difference when using regexes and case-insensitive like, and to the best of my knowladge, the first is only usable by saints and gods in super search (who could live with it being wrong), and the later is not used at all.
Of much larger importance is perl itself, and the version of perl being used here has poor utf8 support (5.6.1), and the fact that we'd have to figure out how to guess what's being sent, and recode all over the place to support non-latin-1 content.
If you find a place where the browser appears to send utf8 (not entities) instead of latin-1, please, point me at it! It may be sending %uXXXX URI escapes, which CGI.pm incorrectly interprets. I've been trying to find a test-case for this for some time. (If, OTOH, it really sends utf8, your plan, or at least what I suspect your plan is, is probably the best way to do it.)
| [reply] |
select 'é' = 'É'
returns "1". Thanks for the info.
Anyway, my idea was to update 'startform' so that all of our forms contain a hidden field something like enc="éñÇ" so we could tell if UTF-8 is coming in from a form. Your %u... stuff will catch some other cases. Checking for Content-Encoding headers coming *in* might catch more. Probably still not 100% coverage, but pretty good.
| [reply] [d/l] |
It just stores byte strings without bothering with encodings. And I'm glad.
You should. IMHO, The choice of simplicity is always the best.
Now I just need to write conversion code (and I think a regex will be easier than porting Encode to PerlMonks, but we'll see).
Be careful not forget what you're glad for ;-). Ie. you'll need bi-directional encoding [browser (???)] <==> [storage (utf8)] here.
you'll need to use HTML entities for accented characters in the text
...and Latin-1 for the title. I'll try to use proper encoding!
____
HTH, Dominique
My two favorites:
If the only tool you have is a hammer, you will see every problem as a nail. --Abraham Maslow
Bien faire, et le faire savoir...
| [reply] |