in reply to Re^5: somethign wrong with the sumbit
in thread somethign wrong with the sumbit

Hello Graff, few things to make clear about encodings before i try to make the test script.

The array contains file names that are read from your directory as iso-8859-7 strings,...
How can be sure that their encoding is 'iso-8859-7' since we dont know what encoding style windows use to save filenames? How we know for example if the encoding wasnt 'cp1253' or 'utf8'?

And also can something that its native some encoding be read as another encoding?

...and you convert them to utf8 before putting them into the popup menu,
I didn't want to but i had too because otherwise firefox wouldn't display the filenames correctly in readable Greek text and really don't know why....is it because the print header was in utf8?
...and you are confident that the strings being returned by the form are being correctly handled as utf8 string.
Well, it seemed the correct thing to believe in. Since the items in the popup menu, after the conversion was made, were 'utf8', wasn't it logical to believe that the submitted item that user selected would be also stored in param('select') and handled as well in a 'utf8' manner? I mean if its a utf8 thing why not be "grabbed" as a utf8 thing and handled as a utf8 thing?
There's a chance that something in the handling of the input parameter string is doing an improper conversion of the original 4-byte sequence into a perl-internal utf8 string. The result of this improper conversion might be an 8-byte string, consisting of: c3 8e c2 a6 c3 8e c2 a5 download
Up until this point i understaned how utf8 encoding stores 1 char as 2 bytes long and hence 2 chars as 4 bytes long but after that i didnt understand...
Which is the "input parameter string" You mean param('select') ?!
What conversion are you refering to? Why change the 4byte string to perl-internal utf8 string?
That's what you get if the original four-byte string is assumed to be non-utf8 (e.g. iso-8859-1) and is then "converted" to utf8 based on that false assumption.
You mean the initial filenames which were 'iso-8859-7' that i re-encoded to 'utf8' in order to be able to display them properly on browser?

Why is this wrong? The content is still the same(the name of the file) only the storage capacity changes. Sorry for 2 many questions but this encoding concept is distorted in my head and i have to ask you to helpe me clear it because i beleive we are in the heart of this weird problem.

Replies are listed 'Best First'.
Re^7: somethign wrong with the sumbit
by graff (Chancellor) on Dec 30, 2007 at 23:32 UTC
    You said:
    How can be sure that their encoding is 'iso-8859-7' since we dont know what encoding style windows use to save filenames? How we know for example if the encoding wasnt 'cp1253' or 'utf8'?
    Dude, I thought you already knew this -- I was just repeating information that I found in your original post at the top of this thread. This is the line in the OP that led me to make that statement:
    Encode::from_to($_, 'ISO-8859-7', 'utf8') for @display_files;

    So you tell me: how can you be sure that the file name encoding is iso-8859-7 on your machine? If you don't know, then you have problems that I probably cannot help you solve.

    And also can something that its native some encoding be read as another encoding?

    A stream of bytes representing character data can be read as if it were anything at all -- it's just a stream of bytes -- but it's only going to make sense if it is interpreted correctly, according to the intended character encoding.

    it seemed the correct thing to believe in

    So your problem boils down to a tendency towards "faith-based programming". Learn to be a skeptic.

    You need to focus on the advice about doing an experiment. I wanted to make sure that this would work, so I've done the experiment already, and now you can try this yourself to see if it works for you. (It works for me.)

    #!/usr/bin/perl -T use strict; use CGI; use Encode; my $c = new CGI; my @menu_choices = ( "\x{03a6}\x{03a5}", "\x{03b2}\x{03c1}" ); binmode STDOUT, ":utf8"; print $c->header, $c->start_html; print $c->h3("(Display should be readable as utf8)"), $c->h3("\x{0395}\x{03c0}\x{03ad}\x{03bb}\x{03b5}\x{03be}\x{03b5} +"); if ( $c->param( 'select' )) { # use bytes; # my $val = $c->param('select'); my $val = decode( 'utf8', $c->param( 'select' )); my $match = ( grep /^$val$/, @menu_choices ) ? "matches" : "fails +to match"; printf "<P>The value %s received from the form has length %d, and +%s.</P>", $val, length( $val ), $match; } print $c->start_form, $c->popup_menu( -name => 'select', -values => \@menu_choices ), $c->submit( 'ok' ); $c->end_form; print $c->end_html; exit;
    If I comment out the "use Encode" and the line with the "decode()" call, and also uncomment the other two lines ("use bytes" and the simpler assignment to $val), it reports a failure to match, and I'm not sure why that doesn't work. (BTW, I'm using Perl 5.8.8, built for macosx 10.5)

    I also tried $c->start_html( -encoding => 'UTF-8') instead of the default html header, and that did not cause my Safari browser's "default" setting for encoding to do the right thing -- it seems I have to set this browser explicitly for any non-Latin-1 character set. This leads me to suggest that it's a good idea to include a clearly visible ASCII string in your page display, telling the viewers what character encoding they should be using in their browsers, as demonstrated in my test script.

    (I know there should be a way to tell the browser how to do the right thing automatically, and I can't wait to learn about that...)

    update: Found it (duh!):

    print $cgi->header(-charset => 'utf-8'), $cgi->start_html;
    works with both Safari and Firefox. Presumably, if you really wanted to use iso-8859-7 instead of utf8 on your web pages, you set the "-charset" property for the http header accordingly. But I think you're better off working with utf8.
      No i didn't knew what encoding the filesystem used to store filenames into. I assumed 'iso-8859-7' because the filename's chars used were of greek chars.

      I quoted many other things in my previous post that you said, and asked so to understand whats going on but you didn't answer me those so i still don't have a clear picture of what went wrong and why the values returned from the form ain't matching one of the values of array @display_files.

      A stream of bytes representing character data can be read as if it were anything at all -- it's just a stream of bytes -- but it's only going to make sense if it is interpreted correctly, according to the intended character encoding.
      What do you mean by "can be read as if it were anything at all" (sorry my english aint that good) and i missed your point.
        I assumed 'iso-8859-7' because the filename's chars used were of greek chars.

        Programming based on assumptions is just like programming based on beliefs -- not good enough. When you lack reliable or coherent documentation, you need to do tests. They can be simple tests, like "do the file names display correctly on the web page when I use encode() this way?" So do the tests, and spend less time asking us to do them for you.

        I quoted many other things in my previous post that you said, and asked so to understand whats going on but you didn't answer me those so i still don't have a clear picture of what went wrong and why the values returned from the form ain't matching one of the values of array @display_files.

        Did you try to run the little test script that I posted? Save that script as a file in your cgi-bin directory (call it "testmenu.pl" or something like that), make it executable, and point your browser at it. When you see that it works, try to set your own application so it handles the menu strings and parameter values the same way. If it does not work for you, try to be as explicit and clear as possible when you report what actually happens (error messages, web page content); if you made changes in the code before running it (though this should be unnecessary), show the code that you actually ran.

        What do you mean by "can be read as if it were anything at all"

        Let's see if I can explain it better. Here's a sequence of four bytes, expressed as hex numbers:

        ce a6 ce a5
        If you treat those bytes as an ISO-8859-1 string, it's four characters, where the first and third are "capital-I-with-circumflex", the second is "broken-bar", and the fourth is "yen-sign". If you treat it as ISO-8859-7 (Greek), the "a6" byte is still the "broken-bar" char, but "a5" is the "drachma-sign" and "ce" is "Greek-capital-letter-Xi". If interpreted as utf8, its just two Unicode Greek letters (capital-Phi / U+03A6, capital-Upsilon / U+03A5). If treated as UTF-16BE (big-endian), it's two other unicode characters: U+CEA6 and U+CEA5 (Hangul syllables); treated as UTF-16LE, it's U+A6CE and U+A5CE, which are unassigned (no unicode characters exist at those code points). Many other non-unicode character sets could be used to get even more interpretations of the string as two or four characters.

        Those same four bytes could even be interpreted as a four-byte integer or as a pair of two-byte integers (signed or unsigned, big- or little-endian) -- that is, you could use perl's "unpack" function to get a variety of numeric values from that same byte sequence.

        The point is this: there in nothing intrinsic to the byte stream that says "this is utf8 text" or anything else. You have to know what it's supposed to be, treat it accordingly, and handle the cases when there are problems with the data source that cause the data to be something different from what it's supposed to be.