in reply to Re^11: somethign wrong with the sumbit
in thread somethign wrong with the sumbit

Hello Graff, thank you for going on our longest/deepest journey :)

As i stated in my pevious i tried to run the test cgi script you gave me so to see if i understood it but i'm getting a Premature end of script headers: testmenu.pl and hence i cant run it.

Is there a chance for a string to be in some encoding then be converted to 'utf8' and still appear properly to screen?
I'm sure it can't cause i tested with the filenames and try to convert them from 'cp1253' to 'utf-8' but then when i tried to display them on web page they looked like chinese :)

Would like to ask you though what exactly this phrase means: The string "nikos" for example is encoded in 'greek-iso' or the string "nikos" is encoded in 'utf-8'. I mean what does it literally mean?
Does it signify the practical way this string is gonna be stored in the hdd/memory in terms of bytes?
For example the char 'n' in greek-iso will be 10010000 while the char 'n' in utf-8 will be stored as 10101010 01011010 ?
Which means we are using different bytes in each case for storing?

Does this explain my question above of why when i tried to read the filenames as cp1253 and convert them to utf8 never got converted correctly and hence been displayed correctly? Because i tried to read them in a different manner other of the one the filesystem used to store them on the hdd

So to put it clearly: the problem with matching the parameter string from the web form with the original file name is that perl has no way of knowing that the parameter string should be interpreted as a utf8 byte sequence. You need to add a line of code that explicitly tells perl to interpret the parameter string as utf8 characters.
I finally understood your point and the reason a fell for that wrong assumption was that the statement print header( -charset=>'utf-8' ); led me to believe that perl will handle any incoming/outgoing variable as 'utf-8' chars. But iam guessing this statemnt only tells the clients browser what encoding to use when displaying stuff and NOT what encoding to use when send over stuff

But if the client's browser aint smart enough to send form strings back to the sending script using the same encoding as the sending script does, HOW am i supposed to add a line to my index.pl telling perl to take the unknown encoded string and convert it to 'utf-8'?

Encode::from_to($param('select'), 'unknown_enc', utf-8');
I mean i dont know the source encoding the browser used to send me the paramater sting back. How will i detect it? More trouble for the programmer that apart from the fact of different encodings now he has to detect them too
It would be so much easier if the browser used the same encoding scheme to return string as the script that sent the browser the parameter, and of course perl would knew that by default.

Replies are listed 'Best First'.
Re^13: somethign wrong with the sumbit
by graff (Chancellor) on Jan 02, 2008 at 06:03 UTC
    i tried to run the test cgi script you gave me so to see if i understood it but i'm getting a Premature end of script headers: testmenu.pl and hence i cant run it.

    I must apologize -- I should have seen that. But you seem to be saying "I haven't tried to figure out why it doesn't run or why I get this error message". The first thing you could try, after downloading the code, is this command line (in a shell window):

    perl -T -cw testmenu.pl
    If there were a syntax error in the script, this would tell you what lines in the script have problems. If there are no errors, you can just run it as a shell command, and it should print out valid HTML to the shell window.

    The "premature end of headers" message just means that the script exited before it got very far in writing HTML data to the browser. Maybe the web server doesn't think the file is executable? If you have a working cgi script, does the file name for that script end in ".pl" or something else? (Often, a web server is configured so that it will only execute a script if it has a specific filename extension; ".pl" might not be right for your server, and you may need to rename testmenu.pl to something else.)

    I can assure you that it runs correctly for me (perl 5.8.8 on macosx, running an apache2 web server), and I don't think there's anything in the script that depends on the particular OS, web server, browser or perl version (so long as it's perl 5.8.0 or later). Please keep trying to see if you can make it work on your machine. In any case, the important thing is the "decode()" line. Use that in your own script, and see what happens.

    ... the char 'n' in greek-iso will be 10010000 while the char 'n' in utf-8 will be stored as 10101010 01011010 ? Which means we are using different bytes in each case for storing?

    First off, let's use hex digits, ok? It's just easier. I'm sorry but 0xB0 (the hex equivalent for 10010000) is not "n" -- it is the "DEGREE SIGN" character (a small raised circle). The Greek "CAPITAL LETTER NU" in 8859-7 is 0xCD and "SMALL LETTER NU" is 0xED (likewise in cp1253). Those single-byte (non-unicode) codepoints represent the same letters as the unicode codepoints 039D and 03BD, respectively. You can look those up at www.unicode.org.

    The Unicode Standard, in its divine wisdom, provides several ways of storing those 16-bit codepoints -- here are the various byte sequences for those two unicode characters, depending on which encoding you choose:

    UTF-16LE: 9D 03 and 8D 03 UTF-16BE: 03 9D and 03 8D utf8: CE 9D and CE BD
    (Note that the binary numbers you gave for "greek-iso 'n' in utf8" were also wrong. You must have been making them up.)

    Each of those byte pairs, when interpreted correctly, is linguistically equivalent to the single byte characters 0xCD and 0xED used in 8859-7 and cp1253 (so obviously unicode uses more bytes per Greek character than the non-unicode encoings). That relationship between different byte values for the "same letter" is what character encoding conversion (the Encode module) is all about.

    If you are handling character data that you know is Greek, and it ends up looking like Chinese when you display it, this means that the byte stream is being misinterpreted. As you should know by now, there are lots of ways to interpret the bytes incorrectly, and only one correct interpretation.

    But if the client's browser aint smart enough to send form strings back to the sending script using the same encoding as the sending script does, HOW am i supposed to add a line to my index.pl telling perl to take the unknown encoded string and convert it to 'utf-8'?

    It's up to the person using the browser to make sure that the browser is using the correct character encoding in order to display the page you send. You control the character set being used, so the browser has to conform to your usage.

    In any case, the form data sent back to your server from the browser is determined by you when you create the form. Assuming the browser user is being cooperative, you will get back the byte sequences that you provided in the form that you sent out. (Of course, non-cooperative users will try to spoof you by sending requests with strings that you never put into your forms; that is what taint checking is all about).

    In other words, when you send a form to a browser, and the user clicks things on the form and submits it, the values sent back are exactly the parameter values that you provided in the form -- the browser is not supposed to do anything to change those values (not even anything like changing the character encoding); it just provides a way for the user to make selections, and it sends back the information you requested about those selections.

      Your test cgi script is syntactically correct and it runs ok form the shell but i dont understand a thing, all i see is raw html data like the following that doesnt make any sense to me, i cant make it work to the browser though and dont understand why, its a mystery because the script is correct:
      <option value="&#9580;&#959;&#9580;&#958;">&#9580;&#959;&#9580;&#958;< +/option> <option value="&#9580;&#9619;&#9575;&#914;">&#9580;&#9619;&#9575;&#914 +;</option>
      In any case, the form data sent back to your server from the browser is determined by you when you create the form.
      By that you mean that when i create the form i must explicitly tell perl what character set to be expecting as return form string by the browser? Isn't print header (-charset => 'utf-8') suppose to do this or i need something extra too?

      OR perhaps you by that you mean

      that, i'm responsible to switch the returned form string to encoding of my liking (utf-8) for example?

      In other words, when you send a form to a browser, and the user clicks things on the form and submits it, the values sent back are exactly the parameter values that you provided in the form -- the browser is not supposed to do anything to change those values (not even anything like changing the character encoding); it just provides a way for the user to make selections, and it sends back the information you requested about those selections.
      If the browser was sending back to my index.pl script exactly the same string the user selected before sumbitting the form(not altering it at any way) then WHY the strings ain't matching when i do this? unless ( grep { $_ eq param('select') } @display_files ) ????
      Because the string before the user hits submit and the string returned after submission is NOT the same, that proves to us that the browser DOES ALTER the submitted data in some way although it shouldn't have done this? Isn't that anticanonical?

      /because you said i needed a line i also tried to:

      $article = decode('utf-8', param('select')); Encode::from_to($article, 'utf-8', 'ISO-8859-7') ; open FILE, "<$ENV{'DOCUMENT_ROOT'}/data/text/$article.txt" or die $ +!; local $/; $data = <FILE>; close FILE;
      So to proper set the return browser string to utf8(i tried it like this so to avoid specifying the source encoding but rather only the ending one but it doesnt work though ) and then re-encode it to greek-iso since that's the encoded needed to open the file later.
        I think it's time for you to get someone to do this job for you -- that is, someone who can talk to you face-to-face (possibly in Greek), and can login on your machine, and can understand what needs to be done.

        I cannot and will not do the job for you, and you are presenting yourself as being too dense, too helpless, and too unable or unwilling to find things out on your own. It's almost bordering on trollishness, and I can't spend any more time trying to teach you things that you should be learning without help from me or other PerlMonks.

        Face it: you are not a programmer. If you need to get some programming work done, you will need to find someone who knows how to do it, and you will most likely need to pay them to do it. If someone else is actually paying you to try to get this done, I think you should, in all honesty, quit this job. tell the boss to hire someone else more qualified, and go off to find yourself another job that is better suited to your skills, whatever they may be.

        The same applies to your other current thread, which is almost as long as this one now, where people have been trying in vain to teach you things that you seem incapable of learning. Give it up.

        A reply falls below the community's threshold of quality. You may see it by logging in.