Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: BUG: code blocks don't retain literal formatting -- could they?

by kcott (Archbishop)
on Sep 15, 2016 at 07:38 UTC ( [id://1171824]=note: print w/replies, xml ) Need Help??


in reply to BUG: code blocks don't retain literal formatting -- could they?

G'day perl-diddler,

Regarding what it says below the input window, maybe changing "... put the characters ..." to "... put these characters ..." would clarify which characters this statement references.

The reason why things are the way they are, is to allow code like this:

sub amp { ... } my $coderef = \&

Unfortunately, your request would render that as:

sub amp { ... } my $coderef = \&

The workaround is to use 'pre' tags instead of 'code' tags for blocks (and 'tt' tags for inline text):

This pi (π) using π does display correctly.
This pi (π) using π does display correctly.
This pi (π) using π does display correctly.
This pi (π) using literal pi character does display correctly.

[Note: On previewing, I noticed that the literal pi character that I pasted into that last example now appears as π in the textarea.]

Use this workaround sparingly as you don't get a [download] link. Also, line wrapping (or absence thereof) can be problematic so aim to keep lines short (I think <= 72 characters is optimal, inasmuch as it doesn't mess up normal page layout).

— Ken

Replies are listed 'Best First'.
Re^2: BUG: code blocks don't retain literal formatting -- could they?
by perl-diddler (Chaplain) on Sep 15, 2016 at 08:37 UTC
    In regards to the rendering problem ... my 1st solution -- don't change user-input characters into HTML entities, wouldn't affect the ability to say 'amp' after a ampersand. Only if we went with the 2nd option of preserving roundtrip integrity. I preferred the 1st which had it not changing the user input, so it wouldn't need to change it the 2nd time, which caused the problem you mentioned.

    I used a literal pi, which was changed into the #960 form in the edit buffer, but then didn't change it back on display. FWIW, all of the different renderings of pi you tried display as pi on my system. Maybe it's a matter of browser configuration? I have my browser's fallback character encoding for 'legacy content'sic that fails to specify a character encoding to UTF-8. It rarely fails -- indicating that even new pages that fail to specify content are usually UTF-8. I'd say <10% actually use western as a default...

      "FWIW, all of the different renderings of pi you tried display as pi on my system."

      That's good (and what I also get). All four

      This pi (π) ... does display correctly.
      

      lines are in a 'pre' block and were intended to contrast with your earlier

      This pi (&#960;) doesn't display correctly.

      in a 'code' block.

      — Ken

      Update: Corrected spelling and capitalization mistakes.

      As best I can tell, with out use utf8; in your Perl5 program, the Perl5 compiler expects the source code to be 8 bit ANSI characters.1 With use utf8; in effect, you may have UTF8 encoded characters in your source code.

      Quoted strings, by default, are treated a streams of 8 bit bytes. With use feature 'unicode_strings'; in effect, you can include UTF8 encoded characters in quoted strings.

      If PM could store the characters/bytes within code tags as-is, then only apply HTML encoding when generating HTML output, I think that would achieve the desired result. (the download link could supply the "raw" bytes with Content-type: application/octet)

      If that can't be done, maybe instead of HTML encoding, do \x encoding. Either way, non-7-bit-ANSI source code gets messed up, but at least double quoted strings might still be correctly interpreted by the Perl compiler.2

      ---

      1 I haven't tried using characters in the range 0x80 .. 0xFF in identifiers in Perl5, but Perl5 keywords all use characters < 0x80.

      2 The open question is, when use feature 'unicode_strings'; is in effect, would "\x80\x77" be interpreted as 2 characters ("\x80" "\x77") or 1 ("\N{U+8077}") ?

        > Quoted strings, by default, are treated a streams of 8 bit bytes. with use feature 'unicode_strings'; in effect, you can include UTF8 encoded characters in quoted strings.

        That's not exactly how it works. Under use utf8;, quoted strings can already contain unicode characters:

        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; my $string = 'Pérl Mönx'; say length $string; # 9, not 11

        See The "Unicode Bug" for details.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        the Perl5 compiler expects the source code to be 8 bit ANSI characters.1
        ---- There is no such thing as 8-bit ANSI. ANSI is only 7 bits. Perhaps you mean the western-euro-centric, ISO-8859 character set which is 8-bit? The first 256 Unicode bytes are the same as the character in ISO-8859, however, in the UTF-8 encoding of Unicode, the upper 128 bytes take 2-bytes to express (preceded with 0xC2).
        2 The open question is, when use feature 'unicode_strings'; is in effect, would "\x80\x77" be interpreted as 2 characters ("\x80" "\x77") or 1 ("\N{U+8077}") ?
        They'll be treated as 2 characters, all the time. On output, however, the \x80 will generate an warning as perl converts it to UTF-8 on output (encoded as \xC2\x80). The \x77 remains \x77 because it is not above \x7f.

        If you want it to remain binary data on output, you must tell perl not to convert it. On input, perl assumes the byte values 0x80-0xFF refer to character identities that coincidentally have the same meaning as the Unicode character with the same value (U+0080 - U+00FF).

        That's the Perl-Unicode bug. Perl is not round-trip safe by default. If you wanted it in binary (as indicated by the fact that it's not encoded properly for UTF-8, but is for ISO-8859, perl will still convert it to UTF-8 for you on output and generate a run-time warning about "wide characters" in output.

        If you meant for it to be valid UTF8 encoded Unicode, you would have encoded it as such (with the 0xC2 in front of each character over 0x7F). However, if you do, you will still get an error as perl treats valid (but not labeled) UTF-8 on input as *BINARY*, and your code will see 2 values for each single Unicode character -- 0xC2 and the 2nd character.

        So if you don't label your input, and you use character values >0x7F and <0x100, you will get wrong behavior out of your program -- either on output (if you intended binary), or on input if you encoded using UTF-8.

        More than one application uses a heuristic to avoid maximum harm to the user -- i.e. if 0xC2 is detected before a byte in the range 0x7F <= CHAR <= 0xFF, then assume input is UTF-8, else if CHAR > 0x7F, assume binary was intended. It isn't perfect, as 0xC2 followed by another character in the >0x7F zone, can occur in binary code, but it is statistically unlikely, and good enough for most users who are obvious to the need to label their I/O streams.

        Of course you would only engage such heuristics when using stream I/O on STDIN/OUT/ERR. Files opened with "open" would always be interpreted as binary unless specified otherwise.

        However, due to some zeal to go Unicode in 5.8.0, all files got interpreted as UTF-8 if your locale specified UTF-8 to be used for encoding. That caused a kneejerk reaction to revert to the "Perl-Unicode" bug to cause errors & warnings where stream and/or file labeling wasn't used.

        The perl situation is completely different than the HTML problem -- in that HTML5 already specifies the default character set as UTF-8, while older sites using HTML4 may still be interpreted as ISO-8859, even though Unicode has been out for over 20 years. Sigh...

      In regards to the rendering problem ... my 1st solution -- don't change user-input characters into HTML entities, wouldn't affect the ability to say 'amp' after a ampersand.

      Thats done by your browser. See Re: Strange letters ...

Re^2: BUG: code blocks don't retain literal formatting -- could they?
by $h4X4_&#124;=73}{ (Monk) on Sep 17, 2016 at 12:05 UTC

    The reason why things are the way they are, is to allow code like this:

    sub amp { ... } my $coderef = \&amp;
    Unfortunately, your request would render that as:
    sub amp { ... } my $coderef = \&


    This bug is caused by not encoding the semicolon and ampersand of the HTML entity. The encoding of the ampersand and semicolon must be done in the same code to not confuse from a past converted HTML entity and must be the first HTML filter.
    Any filter for HTML code that does not encode ampersand and semicolon will have this problem.

    This problem was addressed in a past version of my module AUBBC v4.01 - 11/08/2010
    New version located at AUBBC2

    The fix I use now looks like this.
    s[(&|;)][$1 eq '&' ? '&amp;' : '&#59;']gex;

    Update: spelling

    Update: My bad! Because PerlMonks mixes the HTML entity's with HTML names, will always cause a problem somewhere and no one filter will work in every case. You have to type the HTML name or your S.O.L..
    Hay, welcome to PerlMonks. The place where you need to learn HTML before you can post your Perl question. ッ

      > The place where you need to learn HTML before you can post your Perl question

      Yeah, because knowing HTML is something absolutely pointless, while the knowledge of Markdown, or at least one of its dialects used at StackOveflow, is something most employers need badly.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        If PM were going to implement an alternate "mark up", I would think the natural choice would be POD (since PM is about Perl).

      Why would mixing HTML named entities, hex, numeric, and whatever is a legal char in the document’s charset cause any problems?

      Hay, welcome to PerlMonks. The place where you need to learn HTML before you can post your Perl question.

      PM is not the only (still existing) website that uses HTML for posting. See http://slashdot.org for example.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1171824]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-03-28 19:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found