Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^8: BUG: code blocks don't retain literal formatting -- could they?

by perl-diddler (Chaplain)
on Sep 20, 2016 at 08:33 UTC ( [id://1172206]=note: print w/replies, xml ) Need Help??


in reply to Re^7: BUG: code blocks don't retain literal formatting -- could they?
in thread BUG: code blocks don't retain literal formatting -- could they?

I'm pretty sure that whatever PM sense (proper UTF8 encoded responses or whatever), have no effect on what the Content-type has for charset and encoding attributes. A website can set the Content-type charset and encoding attribs to whatever. That is independent of what they send in the content stream. I.e. Doing one doesn't force the other. They can both be done independently -- however, having them in agreement might be less confusing to some browsers.

Likely what is so, is that those who are interested in UTF8 set their browsers to assume that encoding for pages that don't declare an encoding since many HTML4 websites that don't declare encoding still use UTF8 on their website -- whether by intention or by users typing in UTF8 strings that later get displayed to others. I.e. when we use UTF8, most of us see it properly as UTF8 chars on our browsers, already. What is at issue is that the code blocks convert such things into html-enties when it scans our input into the site, but it doesn't convert them on output because they are in code blocks.

The bug is that they are converted into HTML-entities in the first place.

Too bad no one is interested in fixing this. I guess they went AWOL... ;-)

  • Comment on Re^8: BUG: code blocks don't retain literal formatting -- could they?

Replies are listed 'Best First'.
Re^9: BUG: code blocks don't retain literal formatting -- could they?
by choroba (Cardinal) on Sep 20, 2016 at 10:57 UTC
    > What is at issue is that the code blocks convert such things into html-enties when it scans our input into the site, but it doesn't convert them on output because they are in code blocks.

    No, that's not what happens. Higher unicode characters are converted to entities always , not only in code blocks. The problem is they're displayed correctly outside of code blocks on output, but incorrectly inside: because PerlMonks doesn't parse the nodes and doesn't render code blocks differently to other parts. That makes the fix complicated: the site would have to start parsing the content of the nodes.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re^9: BUG: code blocks don't retain literal formatting -- could they? (browser)
by tye (Sage) on Sep 21, 2016 at 01:21 UTC

    The bug is that they are converted into HTML-entities in the first place.

    Too bad no one is interested in fixing this. I guess they went AWOL... ;-)

    Perhaps you should switch to a browser that still has some people doing development work on it? Yes, what is generating HTML entities is your browser, as I explained in Re: Unicode characters in <code> blocks (browser). (Update: And you were already told this in this thread.)

    The rest of what you wrote above is almost scary in its encouraging unwise and fragile practices.

    - tye        

      Hmmm....I misread what you said... But your reference to the "Unicode characters in <code> blocks (browser)" shows this has been an issue for 10 years and nothing has been done. If trying experiments and backing them out is your idea of "scary", perhaps that's why no one does anything. FWIW, I have feelings about not wanting to change stuff that works and am sometimes not wanting to deal with the stuff that might break if I change something -- but I realize that if I don't try, nothing will ever get done -- I have to be willing to mess things up and *fix them*, in order to learn, grow and move forward. Certainly I mess up often, but I usually am forced to eat my own doo doo, so I have to fix things.

      I don't run a website, but do build lots of my own tools (but not close to a majority). I do build my own kernel customize for my hw. And I try to build various tools in my tool chain (including perl). If I mess things up too badly on my server, I'm offline as its my web-gate (email host, DNS, web-proxy) ... and at times I get "really: 'I so don't want to be doing this on a weekend at 3am in the morning---waaaa!'".

      As far as browser encoding html enties, the comment was unclear about which directly was meant. I wasn't aware my browser was mangling my output before it got to you (which is what you seem to be saying).

      It sounds like, what you are saying is that we need a way to denote to PM that some characters that are in html-entity form, we want to be re-encoded into chars (vs. those who explicitly want to include raw-html-text-entities in their code examples). How many people do that -- I realize it might be part of someone asking about some HTML module in perl, but for the most part how common is that?

      Maybe it would be acceptable to not translate HTML related entities for those chars needed in HTML syntax (like greater+less than, ampersand... etc), but do translate those that are >0x7f (as they aren't used as being part of HTML syntax).

Re^9: BUG: code blocks don't retain literal formatting -- could they?
by hippo (Bishop) on Sep 20, 2016 at 09:50 UTC
    Too bad no one is interested in fixing this.

    Good news, everyone! Looks like we have a volunteer.

      Where do I get access and how do I update the existing code?

      I'm more than happy to look at the problem and see if I'm able to do anything about it.

Re^9: BUG: code blocks don't retain literal formatting -- could they?
by Your Mother (Archbishop) on Sep 20, 2016 at 20:59 UTC
    I guess they went AWOL

    While you may be summarily executed for this in some realms, you don’t actually need anyone’s permission to leave in open source.

    This unheard of freedom in human interaction is part of why it attracts so many bright persons to do so much hard work for so little financial gain. Never slight it, not even implicitly.

Re^9: BUG: code blocks don't retain literal formatting -- could they?
by RonW (Parson) on Sep 20, 2016 at 18:59 UTC
    Too bad no one is interested in fixing this. I guess they went AWOL

    PM's code base dates back the 1990s. http://everything2.com/title/Everything+Engine

    Granted, some of the issues with code tag processing could have been dealt with in the early days of PM, however, the limitations of the windows-1252 character set did not become a problem until years later.

    Unfortunately, getting the PM website to handle Unicode/UTF8 is much more complicated than adding use feature 'unicode_strings'; statements to the code.

      You'll forgive me for not taking someone else's word for it. That's not to say that you may know far more than I how difficult it is, but until I've looked at the issue and seen that it's not worth the effort, I am a dyed in the wool skeptic.

      While the browser can likely convert html entities to binary-streams, I am pretty sure the opposite doesn't happen. Case in point -- here. Why would the browser, browsing a site that identifies itself as windows-1252 interpret user characters as Unicode and convert them into HTML-entities representing the unicode characters?

      Second issue on that -- I've never seen any of my browsers do that on any other site. Though they can convert the entities into a binary stream. But again -- why would the browser convert the html entities into UTF-8 encoded Unicode if the website's encoding was directing conversion.

      My claim is that for entities above the ASCII range, those entities will be converted into UTF-8 to be display in the browser. Case in point -- pi. It's character code is not in windows-1252. The browser converts the entity to UTF-8 -- not windows 1252, which is why I believe the fix is relatively trivial.

        until I've looked at the issue and seen that it's not worth the effort, I am a dyed in the wool skeptic.

        Understandable. I don't know how to get invited to pmdev, but maybe looking at the underlying engine will give you some insight. Do note that the engine is only a "foundation". A lot of the code that actually runs PM is contained in nodes (See Finding the code).

        Why would the browser, browsing a site that identifies itself as windows-1252 interpret user characters as Unicode and convert them into HTML-entities representing the unicode characters?

        If a character can't be represented in windows-1252 (or whatever character set the server says it's using), then an HTML entity is the representation called for in the W3 specifications. At the very least, the server can store the entity as part of the user supplied text.

        why would the browser convert the html entities into UTF-8 encoded Unicode if the website's encoding was directing conversion.

        The website isn't directing conversion. It's only telling the browser what it is sending. If the website tells the browser to expect windows-1252 characters, the browser will perform whatever conversion it needs to be able to display windows-1252 characters. If the server needs to send a character that isn't represented in windows-1252, it has to use an HTML-entity. It expects the browser to know what to do with the entity.

        If the server tells the browser to expect Unicode characters, then the only entities it would need to send would for those characters that are also part of HTML mark up (so the browser knows those aren't part of the HTML mark up).

        Side note: A large percentage of software changes I thought would be trivial, weren't. Most of this was because of new things that the original designers had not even dreamed of, let alone thought of. PM is very old. But keeps on working. If it were ever moved to a newer system, transferring the content might not be practical.

Re^9: BUG: code blocks don't retain literal formatting -- could they?
by RonW (Parson) on Sep 20, 2016 at 18:37 UTC

    I should have been more precise.

    The PM website uses "windows-1252".1

    The web browser will interpret the byte stream as windows-1252 characters. And even if UFT8 encoding were used, the character set is still windows-1252.2

    Therefore, simply not encoding non-ANSI characters (within code tags) into HTML entities would not work.

    Update: Apparently, the HTML entity encoding takes place in the web browser: Re^3: Strange letters ... (clients)

    In theory, this encoding could be reversed, but would still be only a part of the problem.

    ---

    1"windows-1252" is a superset of ANSI that includes some characters needed for some Western European languages. (It is also a superset of ISO-8859-1 (aka "latin-1").)

    2"UTF8" encoding is not specific to Unicode. All it really is is a specification for encoding a 32 bit value in to a variable length string of bytes.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1172206]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-03-29 15:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found