in reply to ISO-Latin-1 as node and UTF-8 in frontpage

I don't see the problem. I even looked at the Monastery Gates while not logged in so that I saw the cached copy and the quotes you mention all show normal for me in every case I tried.

You don't mention the browser you are using but I've seen bugs with Mozilla wanting to do everything in UTF-8 and so making mistakes when given a Latin-1 page like those that PM produces.

One of my favorite tools is:

perl -S GET -SueUs "http://perlmonks.org/?node=..."
which you might want to try (requires that you have the LWP bundle installed, which is a pretty useful thing to have). If you replace -SueUs with -SuedUs, then you get *just* the headers, no page contents.

That command will show you the raw page contents and the headers so you can verify whether UTF-8 characters are being sent and whether headers say Latin-1.

When I do that without the "?node=..." part I get the follow bits:

Content-Type: text/html; charset=ISO-8859-1 ... Title: Perl Monks - The Monastery Gates ... When he asks “How can I be more careful?”, We usually ... but haven’t even heard of CPAN.
Which includes three slanted quotes as single-byte characters (Latin-1 not UTF-8) which matches the headers.

So I don't see the problem in my browser (IE5) and I don't see it when I bypass my browser. You may have found a bug in your browser.

                - tye

Replies are listed 'Best First'.
Re: Re: ISO-Latin-1 as node and UTF-8 in frontpage (not for me)
by bart (Canon) on Sep 14, 2003 at 16:42 UTC
    You're right that the cause is not a perl problem. Indeed, when I get these pages using LWP::Simple and next examine it with a text editor, it looks fine (sortof). Yet the site, or rather the authors of these nodes, don't go compelteley free.

    I wouldn't actually call it a bug in the browser, because the site claims to be emitting ISO-8859-1 text. Well, as mentioned in the root node: these characters are not in this character set. They are in the Windows character set, which is ISO-8859-1 plus some extra printable characters, where ISO-8859-1 has control characters — mirrors of the same characters with the highest bit cleared. I think it's typical for Microsoft to consider their own extensions as ISO-8859-1... :-) but: I expect problems on any other platform or browser. The symptoms will likely not be the same, but the characters will not show up as intended. They need not.

    So I tested it. Every browser I tested it with has problems. These are:

    • Windows
      • Mozilla FireBird: my default browser, the one I first noticed it
      • Mozilla 1.3
      • Netscape 7.0
      • Netscape 4.8
      • MSIE 5.50 (!) Hilarious, really. All (or most) of the above use the Gecko rendering engine, but this one by Microsoft themselves acts in the same way.
      • Opera 6.01
    • Mac Classic (MacOS 9.2.2)
      • MSIE 5.1 — Yup, here too.
      • Netscape 7.02
      • iCab 2.9.1
    • Linux (console only, as I don't have a GUI)
      • Lynx
        I most definitely expected troubles on Linux, but you might be particularily interested in how this browser displayed it:
        I work with a person who is often NOT careful in his work. Beyond that he does well. When he asks ÂHow can I be more careful?, We usually answer. That is up to you to figure out After some thought IÂm not sure this is the right approach.

    You point out this is likely a bug in Mozilla — the fact that the pages show up differently for the same text on the different nodes is the only thing that I would qualify as a bug — it's quite striking that virtually all these browsers display these characters in almost identical ways: as two characters each.

    Now, solutions? Like I said, the cause of the problem is people entering characters from these Windows extended set, but the site — which isn't really to blame, except maybe for accepting them — might remedy that. The simple approach is to replace these curly quotes with the plain Ascii quotes. A bit more advanced would be to use HTML entities. So this site could help careless authors a little by replacing these, and only these, characters (ord range = 128 .. 159).

    Update: I've been told the same thing happens on the Safari browser on MacOSX.

Re: Re: ISO-Latin-1 as node and UTF-8 in frontpage (not for me)
by allolex (Curate) on Sep 14, 2003 at 10:51 UTC

    I'm seeing the problem in Mozilla 1.4 and Konqueror 3.1.3, and (e)links. The links output shows asterisks for the formatted quotes and lynx doesn't show them at all. Like you said, IE 6 (using Crossover, BTW) does not have a problem rendering them.

    --
    Allolex