Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Re^4: BUG: code blocks don't retain literal formatting -- could they?

by perl-diddler (Chaplain)
on Sep 16, 2016 at 08:15 UTC ( #1171919=note: print w/replies, xml ) Need Help??

in reply to Re^3: BUG: code blocks don't retain literal formatting -- could they?
in thread BUG: code blocks don't retain literal formatting -- could they?

the Perl5 compiler expects the source code to be 8 bit ANSI characters.1
---- There is no such thing as 8-bit ANSI. ANSI is only 7 bits. Perhaps you mean the western-euro-centric, ISO-8859 character set which is 8-bit? The first 256 Unicode bytes are the same as the character in ISO-8859, however, in the UTF-8 encoding of Unicode, the upper 128 bytes take 2-bytes to express (preceded with 0xC2).
2 The open question is, when use feature 'unicode_strings'; is in effect, would "\x80\x77" be interpreted as 2 characters ("\x80" "\x77") or 1 ("\N{U+8077}") ?
They'll be treated as 2 characters, all the time. On output, however, the \x80 will generate an warning as perl converts it to UTF-8 on output (encoded as \xC2\x80). The \x77 remains \x77 because it is not above \x7f.

If you want it to remain binary data on output, you must tell perl not to convert it. On input, perl assumes the byte values 0x80-0xFF refer to character identities that coincidentally have the same meaning as the Unicode character with the same value (U+0080 - U+00FF).

That's the Perl-Unicode bug. Perl is not round-trip safe by default. If you wanted it in binary (as indicated by the fact that it's not encoded properly for UTF-8, but is for ISO-8859, perl will still convert it to UTF-8 for you on output and generate a run-time warning about "wide characters" in output.

If you meant for it to be valid UTF8 encoded Unicode, you would have encoded it as such (with the 0xC2 in front of each character over 0x7F). However, if you do, you will still get an error as perl treats valid (but not labeled) UTF-8 on input as *BINARY*, and your code will see 2 values for each single Unicode character -- 0xC2 and the 2nd character.

So if you don't label your input, and you use character values >0x7F and <0x100, you will get wrong behavior out of your program -- either on output (if you intended binary), or on input if you encoded using UTF-8.

More than one application uses a heuristic to avoid maximum harm to the user -- i.e. if 0xC2 is detected before a byte in the range 0x7F <= CHAR <= 0xFF, then assume input is UTF-8, else if CHAR > 0x7F, assume binary was intended. It isn't perfect, as 0xC2 followed by another character in the >0x7F zone, can occur in binary code, but it is statistically unlikely, and good enough for most users who are obvious to the need to label their I/O streams.

Of course you would only engage such heuristics when using stream I/O on STDIN/OUT/ERR. Files opened with "open" would always be interpreted as binary unless specified otherwise.

However, due to some zeal to go Unicode in 5.8.0, all files got interpreted as UTF-8 if your locale specified UTF-8 to be used for encoding. That caused a kneejerk reaction to revert to the "Perl-Unicode" bug to cause errors & warnings where stream and/or file labeling wasn't used.

The perl situation is completely different than the HTML problem -- in that HTML5 already specifies the default character set as UTF-8, while older sites using HTML4 may still be interpreted as ISO-8859, even though Unicode has been out for over 20 years. Sigh...

  • Comment on Re^4: BUG: code blocks don't retain literal formatting -- could they?

Replies are listed 'Best First'.
Re^5: BUG: code blocks don't retain literal formatting -- could they?
by RonW (Parson) on Sep 16, 2016 at 17:49 UTC

    My point was simply to suggest alternate fixes to the Perl Monks website.

    Obviously, the best is to never mess with what's between code tags.

    But, this would require PM to send proper, UTF8 encoded response content back to browsers.

    There may be technical reason why the PM website can't do that. Possible work-arounds to that include (but not limited to):

    • Save the content between code tags as-is, only applying entity encoding when generating HTML. Then download links would provide the code content as-is using "Content-type: application/octet".
    • For content between code tags, use "\x" encoding instead of entity encoding. Since (at least for now), non-7-bit-characters are most likely to occur in quoted strings, Perl itself would be able to decode the characters that appear in quoted strings. (Of course, if they are in the actual source code, either entity or \x encoding will make a mess.)

    Again, these are just alternatives to the proper solution. It would be great if PM is able to properly support UFT8 content. We may have to live with a work around.

      But, this would require PM to send proper, UTF8 encoded response content back to browsers.
      Why? It works now without any extra work in normal text. The only problem is in the CODE blocks, BECAUSE, something reformats input into HTML-entities.

      To fix that, I'd first try not doing that conversion in a code block (and maybe not in text areas). I seem to remember that the HTML entities were provided to allow having "special chars" (special to HTML syntax, like "<" and "&", etc..). But characters above U+0x0079 shouldn't be a problem if they were left "untouched". To handle display of "special chars" in any of the input -- only convert them to HTML-entities on post (if necessary). I'd bet that anything above the normal ASCII range would be fine to leave untouched.

        But, this would require PM to send proper, UTF8 encoded response content back to browsers.

        So the Content-type: header will have the correct charset= and encoding= attributes.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1171919]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2022-11-30 02:51 GMT
Find Nodes?
    Voting Booth?