ISO-Latin-1 as node and UTF-8 in frontpage

Looking at two recent frontpaged nodes, I see some ugly critters there, where some non-Ascii characters should be. The nodes are Seven habits of highly careful coders and Yet Another Perl/PHP/CF/NET Comparison Question, both under the header "new mediations". The characters are intended to be just curly single and double quotes, but for each character you can see two characters there, which appear to be the bytes representing these characters in UTF-8. They look like this, on the frontpage:

When he asks Ā“How can I be more careful?Ā”, We usually answer. Ā“That is up to you to figure outĀ” After some thought IĀ’m not sure this is the right approach.
I'm concerned that they claim they can code Perl but havenĀ’t even heard of CPAN.

Now the odd thing is that if you go look on their own node, it looks just fine:

When he asks “How can I be more careful?”, We usually answer. “That is up to you to figure out” After some thought I’m not sure this is the right approach.
I'm concerned that they claim they can code Perl but haven’t even heard of CPAN.

So it looks to me like the data is just fine in the database.

Now, one can only guess what is happening, but a possibility to look into is that a plain ISO-Latin-1 text string could be concatenated with something that Perl has flagged as a UTF-8 string. Whenever that happens, perl will "promote" the ISO-Latin-1 string to UTF-8, turning each of the bytes with value >= 128 into two or three bytes.

A possible fix, to be on the safe side, it's applicable everywhere, is to make every non-Ascii character an entity, either named entities as by using HTML::Entities, or as numerical entities like ¥, where the number is nothing but the ordinal character code in the Unicode/Latin-1 character set.

n.b. These characters in the above posts are actually not in the ISO-Latin-1 repertoire. They are in the Windows character set, though, which is compatible with ISO-Latin-1 plus a few extra printable characters. So in order to be according to the rules, their numerical value should be replaced by their ordinal value in Unicode.

update So the author of my first example fixed up his node, thereby removing my evidence. :( Well I found another one here.

Comment on ISO-Latin-1 as node and UTF-8 in frontpage Download Code

Replies are listed 'Best First'.
Re: ISO-Latin-1 as node and UTF-8 in frontpage (not for me) by tye (Sage) on Sep 14, 2003 at 05:08 UTC
I don't see the problem. I even looked at the Monastery Gates while not logged in so that I saw the cached copy and the quotes you mention all show normal for me in every case I tried. You don't mention the browser you are using but I've seen bugs with Mozilla wanting to do everything in UTF-8 and so making mistakes when given a Latin-1 page like those that PM produces. One of my favorite tools is: `perl -S GET -SueUs "http://perlmonks.org/?node=..."` [download] which you might want to try (requires that you have the LWP bundle installed, which is a pretty useful thing to have). If you replace -SueUs with -SuedUs, then you get just the headers, no page contents. That command will show you the raw page contents and the headers so you can verify whether UTF-8 characters are being sent and whether headers say Latin-1. When I do that without the "?node=..." part I get the follow bits: `Content-Type: text/html; charset=ISO-8859-1 ... Title: Perl Monks - The Monastery Gates ... When he asks “How can I be more careful?”, We usually ... but haven’t even heard of CPAN.` [download] Which includes three slanted quotes as single-byte characters (Latin-1 not UTF-8) which matches the headers. So I don't see the problem in my browser (IE5) and I don't see it when I bypass my browser. You may have found a bug in your browser. - tye	[reply] [d/l] [select]
Re: Re: ISO-Latin-1 as node and UTF-8 in frontpage (not for me) by bart (Canon) on Sep 14, 2003 at 16:42 UTC
You're right that the cause is not a perl problem. Indeed, when I get these pages using LWP::Simple and next examine it with a text editor, it looks fine (sortof). Yet the site, or rather the authors of these nodes, don't go compelteley free. I wouldn't actually call it a bug in the browser, because the site claims to be emitting ISO-8859-1 text. Well, as mentioned in the root node: these characters are not in this character set. They are in the Windows character set, which is ISO-8859-1 plus some extra printable characters, where ISO-8859-1 has control characters — mirrors of the same characters with the highest bit cleared. I think it's typical for Microsoft to consider their own extensions as ISO-8859-1... :-) but: I expect problems on any other platform or browser. The symptoms will likely not be the same, but the characters will not show up as intended. They need not. So I tested it. Every browser I tested it with has problems. These are: Windows Mozilla FireBird: my default browser, the one I first noticed it Mozilla 1.3 Netscape 7.0 Netscape 4.8 MSIE 5.50 (!) Hilarious, really. All (or most) of the above use the Gecko rendering engine, but this one by Microsoft themselves acts in the same way. Opera 6.01 Mac Classic (MacOS 9.2.2) MSIE 5.1 — Yup, here too. Netscape 7.02 iCab 2.9.1 Linux (console only, as I don't have a GUI) Lynx I most definitely expected troubles on Linux, but you might be particularily interested in how this browser displayed it: `I work with a person who is often NOT careful in his work. Beyond that he does well. When he asks ĀHow can I be more careful?, We usually answer. That is up to you to figure outĀ After some thought IĀm not sure this is the right approach.` [download] You point out this is likely a bug in Mozilla — the fact that the pages show up differently for the same text on the different nodes is the only thing that I would qualify as a bug — it's quite striking that virtually all these browsers display these characters in almost identical ways: as two characters each. Now, solutions? Like I said, the cause of the problem is people entering characters from these Windows extended set, but the site — which isn't really to blame, except maybe for accepting them — might remedy that. The simple approach is to replace these curly quotes with the plain Ascii quotes. A bit more advanced would be to use HTML entities. So this site could help careless authors a little by replacing these, and only these, characters (ord range = 128 .. 159). Update: I've been told the same thing happens on the Safari browser on MacOSX.	[reply] [d/l]
Re: Re: ISO-Latin-1 as node and UTF-8 in frontpage (not for me) by allolex (Curate) on Sep 14, 2003 at 10:51 UTC
I'm seeing the problem in Mozilla 1.4 and Konqueror 3.1.3, and (e)links. The links output shows asterisks for the formatted quotes and lynx doesn't show them at all. Like you said, IE 6 (using Crossover, BTW) does not have a problem rendering them. -- Allolex	[reply]
Re: ISO-Latin-1 as node and UTF-8 in frontpage by dws (Chancellor) on Sep 14, 2003 at 16:34 UTC
Looking at two recent frontpaged nodes, I see some ugly critters there, where some non-Ascii characters should be. After looking at Seven habits of highly careful coders directly and via the Front Page, and examining the results in hex, I was about to post a reply agreeing with tye. All I saw was the 0x93 Latin-1 open double-quote. Then I looked again and saw the accented A. Has someone been editing that node? If not, I can confirm the mixed behavior on WinXP/IE6.	[reply]
Re: ISO-Latin-1 as node and UTF-8 in frontpage by mandog (Curate) on Sep 14, 2003 at 22:50 UTC
I edited seven habits without an putting in updated: note originally This is probably why non ASCII was there sometimes and not others email: mandog	[reply]
Re: ISO-Latin-1 as node and UTF-8 in frontpage by castaway (Parson) on Sep 16, 2003 at 07:31 UTC
Just for info, that Seven habits of highly careful coders node still looks strange. I just tried looking at it and trying most of Operas manual encoding settings (UTF-8, Windows-1252 etc), and none of them showed it without strange characters, most were that A character, utf-8 just showed squares where the " are supposed to be.. (Opera 7.20 beta for Windows, btw) C.	[reply]