Re^9: somethign wrong with the sumbit

I assumed 'iso-8859-7' because the filename's chars used were of greek chars.

Programming based on assumptions is just like programming based on beliefs -- not good enough. When you lack reliable or coherent documentation, you need to do tests. They can be simple tests, like "do the file names display correctly on the web page when I use encode() this way?" So do the tests, and spend less time asking us to do them for you.

I quoted many other things in my previous post that you said, and asked so to understand whats going on but you didn't answer me those so i still don't have a clear picture of what went wrong and why the values returned from the form ain't matching one of the values of array @display_files.

Did you try to run the little test script that I posted? Save that script as a file in your cgi-bin directory (call it "testmenu.pl" or something like that), make it executable, and point your browser at it. When you see that it works, try to set your own application so it handles the menu strings and parameter values the same way. If it does not work for you, try to be as explicit and clear as possible when you report what actually happens (error messages, web page content); if you made changes in the code before running it (though this should be unnecessary), show the code that you actually ran.

What do you mean by "can be read as if it were anything at all"

Let's see if I can explain it better. Here's a sequence of four bytes, expressed as hex numbers:

ce  a6  ce  a5
[download]

If you treat those bytes as an ISO-8859-1 string, it's four characters, where the first and third are "capital-I-with-circumflex", the second is "broken-bar", and the fourth is "yen-sign". If you treat it as ISO-8859-7 (Greek), the "a6" byte is still the "broken-bar" char, but "a5" is the "drachma-sign" and "ce" is "Greek-capital-letter-Xi". If interpreted as utf8, its just two Unicode Greek letters (capital-Phi / U+03A6, capital-Upsilon / U+03A5). If treated as UTF-16BE (big-endian), it's two other unicode characters: U+CEA6 and U+CEA5 (Hangul syllables); treated as UTF-16LE, it's U+A6CE and U+A5CE, which are unassigned (no unicode characters exist at those code points). Many other non-unicode character sets could be used to get even more interpretations of the string as two or four characters.

Those same four bytes could even be interpreted as a four-byte integer or as a pair of two-byte integers (signed or unsigned, big- or little-endian) -- that is, you could use perl's "unpack" function to get a variety of numeric values from that same byte sequence.

The point is this: there in nothing intrinsic to the byte stream that says "this is utf8 text" or anything else. You have to know what it's supposed to be, treat it accordingly, and handle the cases when there are problems with the data source that cause the data to be something different from what it's supposed to be.

Comment on Re^9: somethign wrong with the sumbit Download Code

Replies are listed 'Best First'.
Re^10: somethign wrong with the sumbit by Nik (Initiate) on Jan 01, 2008 at 18:39 UTC
Hello Graff, first of all i'd like to wish you a HAPPY NEW YEAR 2008 FOR YOU AND YOUR FAMILY AND ALL THE BEST AT A PERSONAL LEVEL! (to all aother member i wish the same thing too) Thank you very much for the detailed explanation on how a byte sequence can be interpreted as string, or as 1 or 2 numbers depending on the charset the reader uses. So, i had to know before i used the Encode::from_to($_, 'ISO-8859-7', 'utf-8') for @display_files; what encoding the filesystem(xp) used to store filenames. I still have no idea though what is it. They can be simple tests, like "do the file names display correctly on the web page when I use encode() this way?" Does that mean that if i assume the fs used 'greek-iso' encoding and i convert it to 'utf8' and is being display properly in the web page, was i correct to initially assumed the filenames were stored as 'greek-iso' ? Is there a chance for the filenames to be in another encoding then be converted to 'utf8' and still appear properly to screen? I tried to run the test script you gave me so to see if i understood it but i'm getting a Premature end of script headers: testmenu.pl i have read a post of yours many times and i asked for more explanation on this http://perlmonks.org/?parent=659662;node_id=3333 please see the latest questions and try to answer them too if you want. I'm still confused about the nature of the encoding issue - 2 versions i have in mind: a) Am i right to believe that this whole encoding issue is due to false assumption of the source of the filenames encoding as being greek-iso and because they weren't(i guess, still don't know if they are greek-iso or not) they were wrongly converted to 'utf8' that why after the form summission they dont match? But then how come they were displayed correctly in the webpage if the encoding source wasn't greek-iso in the first place? b) Is this whole encoding issue is due to the fact that although the filenames correctly converted to utf8, a client's browser internal form submission function took that string, somehow alter it(God knows how) and returned to index.pl a string consisted of the same chars but different encoding? I'm really striving to understand what's going on here...that encoding issue still gives me troubles for years and before i correct it i need to understand it first	[reply]
Re^11: somethign wrong with the sumbit by graff (Chancellor) on Jan 01, 2008 at 21:24 UTC
This is one of the longest/deepest dialogs I've had at PM, but you seem to be making progress, so I'm glad. (This last reply of yours does not give any evidence that you tried running the simple cgi script that I posted above, so you have made less progress than I would have liked, but I'll get over it.) ...if i assume the fs used 'greek-iso' encoding and i convert it to 'utf8' and is being display properly in the web page, was i correct to initially assumed the filenames were stored as 'greek-iso' ? That is called "the scientific method", also known as "the empirical approach", and "programming by experiment". The code is written according to a "hypothesis" about the data, and the results of running the code give you the evidence you need to decide whether the hypothesis was correct. You got that -- bravo! b) Is this whole encoding issue is due to the fact that although the filenames correctly converted to utf8, a client's browser internal form submission function took that string, somehow alter it(God knows how) and returned to index.pl a string consisted of the same chars but different encoding? It's actually simpler than that. The problem hinges on a false assumption that you made at this point in the current thread -- what you said was: i beleive there is no need to explicitly tell perl to handle param('select') as utf8 it must do this by default i think. (sigh) That sort of assumption is worthless until you test it, or find trustworthy documentation that supports or contradicts it. My test cgi script proves that this assumption you made is wrong. So to put it clearly: the problem with matching the parameter string from the web form with the original file name is that perl has no way of knowing that the parameter string should be interpreted as a utf8 byte sequence. You need to add a line of code that explicitly tells perl to interpret the parameter string as utf8 characters.	[reply]
Re^12: somethign wrong with the sumbit by Nik (Initiate) on Jan 01, 2008 at 23:22 UTC
Hello Graff, thank you for going on our longest/deepest journey :) As i stated in my pevious i tried to run the test cgi script you gave me so to see if i understood it but i'm getting a Premature end of script headers: testmenu.pl and hence i cant run it. Is there a chance for a string to be in some encoding then be converted to 'utf8' and still appear properly to screen? I'm sure it can't cause i tested with the filenames and try to convert them from 'cp1253' to 'utf-8' but then when i tried to display them on web page they looked like chinese :) Would like to ask you though what exactly this phrase means: The string "nikos" for example is encoded in 'greek-iso' or the string "nikos" is encoded in 'utf-8'. I mean what does it literally mean? Does it signify the practical way this string is gonna be stored in the hdd/memory in terms of bytes? For example the char 'n' in greek-iso will be 10010000 while the char 'n' in utf-8 will be stored as 10101010 01011010 ? Which means we are using different bytes in each case for storing? Does this explain my question above of why when i tried to read the filenames as cp1253 and convert them to utf8 never got converted correctly and hence been displayed correctly? Because i tried to read them in a different manner other of the one the filesystem used to store them on the hdd So to put it clearly: the problem with matching the parameter string from the web form with the original file name is that perl has no way of knowing that the parameter string should be interpreted as a utf8 byte sequence. You need to add a line of code that explicitly tells perl to interpret the parameter string as utf8 characters. I finally understood your point and the reason a fell for that wrong assumption was that the statement print header( -charset=>'utf-8' ); led me to believe that perl will handle any incoming/outgoing variable as 'utf-8' chars. But iam guessing this statemnt only tells the clients browser what encoding to use when displaying stuff and NOT what encoding to use when send over stuff But if the client's browser aint smart enough to send form strings back to the sending script using the same encoding as the sending script does, HOW am i supposed to add a line to my index.pl telling perl to take the unknown encoded string and convert it to 'utf-8'? `Encode::from_to($param('select'), 'unknown_enc', utf-8');` [download] I mean i dont know the source encoding the browser used to send me the paramater sting back. How will i detect it? More trouble for the programmer that apart from the fact of different encodings now he has to detect them too It would be so much easier if the browser used the same encoding scheme to return string as the script that sent the browser the parameter, and of course perl would knew that by default.	[reply] [d/l]
Re^13: somethign wrong with the sumbit by graff (Chancellor) on Jan 02, 2008 at 06:03 UTC
Re^14: somethign wrong with the sumbit by Nik (Initiate) on Jan 02, 2008 at 19:01 UTC
Some notes below your chosen depth have not been shown here