Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
the Perl5 compiler expects the source code to be 8 bit ANSI characters.1
---- There is no such thing as 8-bit ANSI. ANSI is only 7 bits. Perhaps you mean the western-euro-centric, ISO-8859 character set which is 8-bit? The first 256 Unicode bytes are the same as the character in ISO-8859, however, in the UTF-8 encoding of Unicode, the upper 128 bytes take 2-bytes to express (preceded with 0xC2).
2 The open question is, when use feature 'unicode_strings'; is in effect, would "\x80\x77" be interpreted as 2 characters ("\x80" "\x77") or 1 ("\N{U+8077}") ?
They'll be treated as 2 characters, all the time. On output, however, the \x80 will generate an warning as perl converts it to UTF-8 on output (encoded as \xC2\x80). The \x77 remains \x77 because it is not above \x7f.

If you want it to remain binary data on output, you must tell perl not to convert it. On input, perl assumes the byte values 0x80-0xFF refer to character identities that coincidentally have the same meaning as the Unicode character with the same value (U+0080 - U+00FF).

That's the Perl-Unicode bug. Perl is not round-trip safe by default. If you wanted it in binary (as indicated by the fact that it's not encoded properly for UTF-8, but is for ISO-8859, perl will still convert it to UTF-8 for you on output and generate a run-time warning about "wide characters" in output.

If you meant for it to be valid UTF8 encoded Unicode, you would have encoded it as such (with the 0xC2 in front of each character over 0x7F). However, if you do, you will still get an error as perl treats valid (but not labeled) UTF-8 on input as *BINARY*, and your code will see 2 values for each single Unicode character -- 0xC2 and the 2nd character.

So if you don't label your input, and you use character values >0x7F and <0x100, you will get wrong behavior out of your program -- either on output (if you intended binary), or on input if you encoded using UTF-8.

More than one application uses a heuristic to avoid maximum harm to the user -- i.e. if 0xC2 is detected before a byte in the range 0x7F <= CHAR <= 0xFF, then assume input is UTF-8, else if CHAR > 0x7F, assume binary was intended. It isn't perfect, as 0xC2 followed by another character in the >0x7F zone, can occur in binary code, but it is statistically unlikely, and good enough for most users who are obvious to the need to label their I/O streams.

Of course you would only engage such heuristics when using stream I/O on STDIN/OUT/ERR. Files opened with "open" would always be interpreted as binary unless specified otherwise.

However, due to some zeal to go Unicode in 5.8.0, all files got interpreted as UTF-8 if your locale specified UTF-8 to be used for encoding. That caused a kneejerk reaction to revert to the "Perl-Unicode" bug to cause errors & warnings where stream and/or file labeling wasn't used.

The perl situation is completely different than the HTML problem -- in that HTML5 already specifies the default character set as UTF-8, while older sites using HTML4 may still be interpreted as ISO-8859, even though Unicode has been out for over 20 years. Sigh...


In reply to Re^4: BUG: code blocks don't retain literal formatting -- could they? by perl-diddler
in thread BUG: code blocks don't retain literal formatting -- could they? by perl-diddler

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-03-28 19:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found