Looking at two recent frontpaged nodes, I see some ugly critters there, where some non-Ascii characters should be. The nodes are Seven habits of highly careful coders and Yet Another Perl/PHP/CF/NET Comparison Question, both under the header "new mediations". The characters are intended to be just curly single and double quotes, but for each character you can see two characters there, which appear to be the bytes representing these characters in UTF-8. They look like this, on the frontpage: Now the odd thing is that if you go look on their own node, it looks just fine: So it looks to me like the data is just fine in the database.

Now, one can only guess what is happening, but a possibility to look into is that a plain ISO-Latin-1 text string could be concatenated with something that Perl has flagged as a UTF-8 string. Whenever that happens, perl will "promote" the ISO-Latin-1 string to UTF-8, turning each of the bytes with value >= 128 into two or three bytes.

A possible fix, to be on the safe side, it's applicable everywhere, is to make every non-Ascii character an entity, either named entities as by using HTML::Entities, or as numerical entities like ¥, where the number is nothing but the ordinal character code in the Unicode/Latin-1 character set.

n.b. These characters in the above posts are actually not in the ISO-Latin-1 repertoire. They are in the Windows character set, though, which is compatible with ISO-Latin-1 plus a few extra printable characters. So in order to be according to the rules, their numerical value should be replaced by their ordinal value in Unicode.

update So the author of my first example fixed up his node, thereby removing my evidence. :( Well I found another one here.


In reply to ISO-Latin-1 as node and UTF-8 in frontpage by bart

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.