in reply to Re^9: somethign wrong with the sumbit
in thread somethign wrong with the sumbit

Hello Graff, first of all i'd like to wish you a HAPPY NEW YEAR 2008 FOR YOU AND YOUR FAMILY AND ALL THE BEST AT A PERSONAL LEVEL! (to all aother member i wish the same thing too)

Thank you very much for the detailed explanation on how a byte sequence can be interpreted as string, or as 1 or 2 numbers depending on the charset the reader uses. So, i had to know before i used the Encode::from_to($_, 'ISO-8859-7', 'utf-8') for @display_files; what encoding the filesystem(xp) used to store filenames. I still have no idea though what is it.

They can be simple tests, like "do the file names display correctly on the web page when I use encode() this way?"
Does that mean that if i assume the fs used 'greek-iso' encoding and i convert it to 'utf8' and is being display properly in the web page, was i correct to initially assumed the filenames were stored as 'greek-iso' ?
Is there a chance for the filenames to be in another encoding then be converted to 'utf8' and still appear properly to screen? I tried to run the test script you gave me so to see if i understood it but i'm getting a Premature end of script headers: testmenu.pl i have read a post of yours many times and i asked for more explanation on this http://perlmonks.org/?parent=659662;node_id=3333 please see the latest questions and try to answer them too if you want.

I'm still confused about the nature of the encoding issue - 2 versions i have in mind:

a) Am i right to believe that this whole encoding issue is due to false assumption of the source of the filenames encoding as being greek-iso and because they weren't(i guess, still don't know if they are greek-iso or not) they were wrongly converted to 'utf8' that why after the form summission they dont match? But then how come they were displayed correctly in the webpage if the encoding source wasn't greek-iso in the first place?

b) Is this whole encoding issue is due to the fact that although the filenames correctly converted to utf8, a client's browser internal form submission function took that string, somehow alter it(God knows how) and returned to index.pl a string consisted of the same chars but different encoding?

I'm really striving to understand what's going on here...that encoding issue still gives me troubles for years and before i correct it i need to understand it first

Replies are listed 'Best First'.
Re^11: somethign wrong with the sumbit
by graff (Chancellor) on Jan 01, 2008 at 21:24 UTC
    This is one of the longest/deepest dialogs I've had at PM, but you seem to be making progress, so I'm glad. (This last reply of yours does not give any evidence that you tried running the simple cgi script that I posted above, so you have made less progress than I would have liked, but I'll get over it.)

    ...if i assume the fs used 'greek-iso' encoding and i convert it to 'utf8' and is being display properly in the web page, was i correct to initially assumed the filenames were stored as 'greek-iso' ?

    That is called "the scientific method", also known as "the empirical approach", and "programming by experiment". The code is written according to a "hypothesis" about the data, and the results of running the code give you the evidence you need to decide whether the hypothesis was correct. You got that -- bravo!

    b) Is this whole encoding issue is due to the fact that although the filenames correctly converted to utf8, a client's browser internal form submission function took that string, somehow alter it(God knows how) and returned to index.pl a string consisted of the same chars but different encoding?

    It's actually simpler than that. The problem hinges on a false assumption that you made at this point in the current thread -- what you said was: i beleive there is no need to explicitly tell perl to handle param('select') as utf8 it must do this by default i think. (**sigh**) That sort of assumption is worthless until you test it, or find trustworthy documentation that supports or contradicts it. My test cgi script proves that this assumption you made is wrong.

    So to put it clearly: the problem with matching the parameter string from the web form with the original file name is that perl has no way of knowing that the parameter string should be interpreted as a utf8 byte sequence. You need to add a line of code that explicitly tells perl to interpret the parameter string as utf8 characters.

      Hello Graff, thank you for going on our longest/deepest journey :)

      As i stated in my pevious i tried to run the test cgi script you gave me so to see if i understood it but i'm getting a Premature end of script headers: testmenu.pl and hence i cant run it.

      Is there a chance for a string to be in some encoding then be converted to 'utf8' and still appear properly to screen?
      I'm sure it can't cause i tested with the filenames and try to convert them from 'cp1253' to 'utf-8' but then when i tried to display them on web page they looked like chinese :)

      Would like to ask you though what exactly this phrase means: The string "nikos" for example is encoded in 'greek-iso' or the string "nikos" is encoded in 'utf-8'. I mean what does it literally mean?
      Does it signify the practical way this string is gonna be stored in the hdd/memory in terms of bytes?
      For example the char 'n' in greek-iso will be 10010000 while the char 'n' in utf-8 will be stored as 10101010 01011010 ?
      Which means we are using different bytes in each case for storing?

      Does this explain my question above of why when i tried to read the filenames as cp1253 and convert them to utf8 never got converted correctly and hence been displayed correctly? Because i tried to read them in a different manner other of the one the filesystem used to store them on the hdd

      So to put it clearly: the problem with matching the parameter string from the web form with the original file name is that perl has no way of knowing that the parameter string should be interpreted as a utf8 byte sequence. You need to add a line of code that explicitly tells perl to interpret the parameter string as utf8 characters.
      I finally understood your point and the reason a fell for that wrong assumption was that the statement print header( -charset=>'utf-8' ); led me to believe that perl will handle any incoming/outgoing variable as 'utf-8' chars. But iam guessing this statemnt only tells the clients browser what encoding to use when displaying stuff and NOT what encoding to use when send over stuff

      But if the client's browser aint smart enough to send form strings back to the sending script using the same encoding as the sending script does, HOW am i supposed to add a line to my index.pl telling perl to take the unknown encoded string and convert it to 'utf-8'?

      Encode::from_to($param('select'), 'unknown_enc', utf-8');
      I mean i dont know the source encoding the browser used to send me the paramater sting back. How will i detect it? More trouble for the programmer that apart from the fact of different encodings now he has to detect them too
      It would be so much easier if the browser used the same encoding scheme to return string as the script that sent the browser the parameter, and of course perl would knew that by default.
        i tried to run the test cgi script you gave me so to see if i understood it but i'm getting a Premature end of script headers: testmenu.pl and hence i cant run it.

        I must apologize -- I should have seen that. But you seem to be saying "I haven't tried to figure out why it doesn't run or why I get this error message". The first thing you could try, after downloading the code, is this command line (in a shell window):

        perl -T -cw testmenu.pl
        If there were a syntax error in the script, this would tell you what lines in the script have problems. If there are no errors, you can just run it as a shell command, and it should print out valid HTML to the shell window.

        The "premature end of headers" message just means that the script exited before it got very far in writing HTML data to the browser. Maybe the web server doesn't think the file is executable? If you have a working cgi script, does the file name for that script end in ".pl" or something else? (Often, a web server is configured so that it will only execute a script if it has a specific filename extension; ".pl" might not be right for your server, and you may need to rename testmenu.pl to something else.)

        I can assure you that it runs correctly for me (perl 5.8.8 on macosx, running an apache2 web server), and I don't think there's anything in the script that depends on the particular OS, web server, browser or perl version (so long as it's perl 5.8.0 or later). Please keep trying to see if you can make it work on your machine. In any case, the important thing is the "decode()" line. Use that in your own script, and see what happens.

        ... the char 'n' in greek-iso will be 10010000 while the char 'n' in utf-8 will be stored as 10101010 01011010 ? Which means we are using different bytes in each case for storing?

        First off, let's use hex digits, ok? It's just easier. I'm sorry but 0xB0 (the hex equivalent for 10010000) is not "n" -- it is the "DEGREE SIGN" character (a small raised circle). The Greek "CAPITAL LETTER NU" in 8859-7 is 0xCD and "SMALL LETTER NU" is 0xED (likewise in cp1253). Those single-byte (non-unicode) codepoints represent the same letters as the unicode codepoints 039D and 03BD, respectively. You can look those up at www.unicode.org.

        The Unicode Standard, in its divine wisdom, provides several ways of storing those 16-bit codepoints -- here are the various byte sequences for those two unicode characters, depending on which encoding you choose:

        UTF-16LE: 9D 03 and 8D 03 UTF-16BE: 03 9D and 03 8D utf8: CE 9D and CE BD
        (Note that the binary numbers you gave for "greek-iso 'n' in utf8" were also wrong. You must have been making them up.)

        Each of those byte pairs, when interpreted correctly, is linguistically equivalent to the single byte characters 0xCD and 0xED used in 8859-7 and cp1253 (so obviously unicode uses more bytes per Greek character than the non-unicode encoings). That relationship between different byte values for the "same letter" is what character encoding conversion (the Encode module) is all about.

        If you are handling character data that you know is Greek, and it ends up looking like Chinese when you display it, this means that the byte stream is being misinterpreted. As you should know by now, there are lots of ways to interpret the bytes incorrectly, and only one correct interpretation.

        But if the client's browser aint smart enough to send form strings back to the sending script using the same encoding as the sending script does, HOW am i supposed to add a line to my index.pl telling perl to take the unknown encoded string and convert it to 'utf-8'?

        It's up to the person using the browser to make sure that the browser is using the correct character encoding in order to display the page you send. You control the character set being used, so the browser has to conform to your usage.

        In any case, the form data sent back to your server from the browser is determined by you when you create the form. Assuming the browser user is being cooperative, you will get back the byte sequences that you provided in the form that you sent out. (Of course, non-cooperative users will try to spoof you by sending requests with strings that you never put into your forms; that is what taint checking is all about).

        In other words, when you send a form to a browser, and the user clicks things on the form and submits it, the values sent back are exactly the parameter values that you provided in the form -- the browser is not supposed to do anything to change those values (not even anything like changing the character encoding); it just provides a way for the user to make selections, and it sends back the information you requested about those selections.