in reply to Re^7: somethign wrong with the sumbit
in thread somethign wrong with the sumbit

No i didn't knew what encoding the filesystem used to store filenames into. I assumed 'iso-8859-7' because the filename's chars used were of greek chars.

I quoted many other things in my previous post that you said, and asked so to understand whats going on but you didn't answer me those so i still don't have a clear picture of what went wrong and why the values returned from the form ain't matching one of the values of array @display_files.

A stream of bytes representing character data can be read as if it were anything at all -- it's just a stream of bytes -- but it's only going to make sense if it is interpreted correctly, according to the intended character encoding.
What do you mean by "can be read as if it were anything at all" (sorry my english aint that good) and i missed your point.

Replies are listed 'Best First'.
Re^9: somethign wrong with the sumbit
by graff (Chancellor) on Dec 31, 2007 at 23:27 UTC
    I assumed 'iso-8859-7' because the filename's chars used were of greek chars.

    Programming based on assumptions is just like programming based on beliefs -- not good enough. When you lack reliable or coherent documentation, you need to do tests. They can be simple tests, like "do the file names display correctly on the web page when I use encode() this way?" So do the tests, and spend less time asking us to do them for you.

    I quoted many other things in my previous post that you said, and asked so to understand whats going on but you didn't answer me those so i still don't have a clear picture of what went wrong and why the values returned from the form ain't matching one of the values of array @display_files.

    Did you try to run the little test script that I posted? Save that script as a file in your cgi-bin directory (call it "testmenu.pl" or something like that), make it executable, and point your browser at it. When you see that it works, try to set your own application so it handles the menu strings and parameter values the same way. If it does not work for you, try to be as explicit and clear as possible when you report what actually happens (error messages, web page content); if you made changes in the code before running it (though this should be unnecessary), show the code that you actually ran.

    What do you mean by "can be read as if it were anything at all"

    Let's see if I can explain it better. Here's a sequence of four bytes, expressed as hex numbers:

    ce a6 ce a5
    If you treat those bytes as an ISO-8859-1 string, it's four characters, where the first and third are "capital-I-with-circumflex", the second is "broken-bar", and the fourth is "yen-sign". If you treat it as ISO-8859-7 (Greek), the "a6" byte is still the "broken-bar" char, but "a5" is the "drachma-sign" and "ce" is "Greek-capital-letter-Xi". If interpreted as utf8, its just two Unicode Greek letters (capital-Phi / U+03A6, capital-Upsilon / U+03A5). If treated as UTF-16BE (big-endian), it's two other unicode characters: U+CEA6 and U+CEA5 (Hangul syllables); treated as UTF-16LE, it's U+A6CE and U+A5CE, which are unassigned (no unicode characters exist at those code points). Many other non-unicode character sets could be used to get even more interpretations of the string as two or four characters.

    Those same four bytes could even be interpreted as a four-byte integer or as a pair of two-byte integers (signed or unsigned, big- or little-endian) -- that is, you could use perl's "unpack" function to get a variety of numeric values from that same byte sequence.

    The point is this: there in nothing intrinsic to the byte stream that says "this is utf8 text" or anything else. You have to know what it's supposed to be, treat it accordingly, and handle the cases when there are problems with the data source that cause the data to be something different from what it's supposed to be.

      Hello Graff, first of all i'd like to wish you a HAPPY NEW YEAR 2008 FOR YOU AND YOUR FAMILY AND ALL THE BEST AT A PERSONAL LEVEL! (to all aother member i wish the same thing too)

      Thank you very much for the detailed explanation on how a byte sequence can be interpreted as string, or as 1 or 2 numbers depending on the charset the reader uses. So, i had to know before i used the Encode::from_to($_, 'ISO-8859-7', 'utf-8') for @display_files; what encoding the filesystem(xp) used to store filenames. I still have no idea though what is it.

      They can be simple tests, like "do the file names display correctly on the web page when I use encode() this way?"
      Does that mean that if i assume the fs used 'greek-iso' encoding and i convert it to 'utf8' and is being display properly in the web page, was i correct to initially assumed the filenames were stored as 'greek-iso' ?
      Is there a chance for the filenames to be in another encoding then be converted to 'utf8' and still appear properly to screen? I tried to run the test script you gave me so to see if i understood it but i'm getting a Premature end of script headers: testmenu.pl i have read a post of yours many times and i asked for more explanation on this http://perlmonks.org/?parent=659662;node_id=3333 please see the latest questions and try to answer them too if you want.

      I'm still confused about the nature of the encoding issue - 2 versions i have in mind:

      a) Am i right to believe that this whole encoding issue is due to false assumption of the source of the filenames encoding as being greek-iso and because they weren't(i guess, still don't know if they are greek-iso or not) they were wrongly converted to 'utf8' that why after the form summission they dont match? But then how come they were displayed correctly in the webpage if the encoding source wasn't greek-iso in the first place?

      b) Is this whole encoding issue is due to the fact that although the filenames correctly converted to utf8, a client's browser internal form submission function took that string, somehow alter it(God knows how) and returned to index.pl a string consisted of the same chars but different encoding?

      I'm really striving to understand what's going on here...that encoding issue still gives me troubles for years and before i correct it i need to understand it first

        This is one of the longest/deepest dialogs I've had at PM, but you seem to be making progress, so I'm glad. (This last reply of yours does not give any evidence that you tried running the simple cgi script that I posted above, so you have made less progress than I would have liked, but I'll get over it.)

        ...if i assume the fs used 'greek-iso' encoding and i convert it to 'utf8' and is being display properly in the web page, was i correct to initially assumed the filenames were stored as 'greek-iso' ?

        That is called "the scientific method", also known as "the empirical approach", and "programming by experiment". The code is written according to a "hypothesis" about the data, and the results of running the code give you the evidence you need to decide whether the hypothesis was correct. You got that -- bravo!

        b) Is this whole encoding issue is due to the fact that although the filenames correctly converted to utf8, a client's browser internal form submission function took that string, somehow alter it(God knows how) and returned to index.pl a string consisted of the same chars but different encoding?

        It's actually simpler than that. The problem hinges on a false assumption that you made at this point in the current thread -- what you said was: i beleive there is no need to explicitly tell perl to handle param('select') as utf8 it must do this by default i think. (**sigh**) That sort of assumption is worthless until you test it, or find trustworthy documentation that supports or contradicts it. My test cgi script proves that this assumption you made is wrong.

        So to put it clearly: the problem with matching the parameter string from the web form with the original file name is that perl has no way of knowing that the parameter string should be interpreted as a utf8 byte sequence. You need to add a line of code that explicitly tells perl to interpret the parameter string as utf8 characters.