Nik has asked for the wisdom of the Perl Monks concerning the following question:

Hello, is there a way to avoid this rncoding transformaions from greek-iso to utf8 and the other way around?
If yes plz let me know because in windows this is very anoying. I beleive that xp save the contents of the files in UTF-8 but the filenames in greek-iso, but i want to hear from you. Thanks.
print start_form( action=>'index.pl' ); print h1( {class=>'lime'}, "&#917;&#960;&#941;&#955;&#949;&#958;&# +949; &#964;&#959; &#954;&#949;&#943;&#956;&#949;&#957;&#959; &#960;&# +959;&#965; &#963;&#949; &#949;&#957;&#948;&#953;&#945;&#966;&#941;&#9 +61;&#949;&#953; => ", popup_menu( -name=>'select', -values=> +\@display_files ), submit('&#917;&#956;&#966;&#940;&#957; +&#953;&#963;&#951;')); print end_form; my $passage = param('select') || "&#913;&#961;&#967;&#953;&#954;&#942; + &#931;&#949;&#955;&#943;&#948;&#945;!"; Encode::from_to($passage, "utf8", "ISO-8859-7") if param(); if ( param('select') ) { open(FILE, "<../data/text/$passage.txt") or die $!; local $/; $data = <FILE>; close(FILE); Encode::from_to($passage, "ISO-8859-7", "utf8");

Replies are listed 'Best First'.
Re: Encodings problem
by Joost (Canon) on Oct 07, 2006 at 18:48 UTC
    Hello, is there a way to avoid this rncoding transformaions from greek-iso to utf8 and the other way around?

    Erm.. Yeah. Just use UTF8 for everything in the files and 7-bit ASCII for the filenames. It'll save you a load of troubles. Also, Encode does not do anything with XML/HTML numeric entities, and you shouldn't need them if you "use utf8" in your script.

    And PLEASE try to remember you shouldn't trust remote users' input. i.e. do not trust the user to enter a valid filename. I've had way too many discussions with you about that already.

    update: I thought this node seemed familiar. And here, and here. If you're still having problems with the same code after 7 months, please be so kind as to point to earlier posts about it. Especially since your posts are VERY difficult to understand, so every little bit helps.

    update 2: as far as I can see your big problem is NOT that your using UTF8 encoding for the filenames, it's that you're using XML numeric entities instead of proper UTF8 strings. I.e. don't use "&#255;" use "\x{ff}".

      But i dont useXML numeric entitiesnowhere. Thsi just happens when i copy/paste from the script to show you here. ps. I need the use rto select a filename but many try backware directory traversal trichsk. How do i properly insure that what will i get form the user will be only a filename with an dnot a string that contains dots and backslashes? ps2: are you sure that windows is able to save the filanems in pure utf8? if yes my files arent showed up normally but insated i ahve t use the encode function? Iam very confused about this.
        Well, how about you make a small, selfcontained program that demonstrates your problem (i.e. reading/writing files with UTF8 filenames - it shouldn't take more than 5 - 10 lines - and I do not mean a CGI program that needs lots of other files - just a simple command line program is much better) and if that still shows the problem, post it here along with a clear description of the problem involving at least the full program, the expected output and the actual output. you should know the deal by now.

        Also, I realize you're greek and english isn't your native language, but PLEASE try to spell at least the simple words like "this", "to" and "don't" correctly. A spelling error here or there isn't a problem but this is just very annoying to read.

        Now, one important tip: almost all the time you do not want to use Encode::from_to. You'll want to use Encode::encode (from utf8 to some other encoding) and Encode::decode (from some encoding to utf8). see the docs.

Re: Encodings problem
by graff (Chancellor) on Oct 08, 2006 at 00:04 UTC
    I don't understand why you find this to be such an annoyance. Do you have the ability to tell Windows which character encoding it should use when storing file names in directories? (That isn't a Perl question, and I'm not a Windows user, so I don't know.)

    If you can make your Windows system use utf8 for the file names, do that, so that the character encoding of the file names matches the character encoding of your web/cgi data.

    If Windows will only use iso-8859 for file names in Greek, then your choices are limited to:

    1. Do all your web/cgi data in iso-8859, to match the encoding used in file names, or else
    2. Keep the web/cgi content in utf8, and just transliterate file name strings from one encoding to the other when you have to.
    It's really not that big a deal either way, but personally, if I had a lot of web content in utf8 already, and I couldn't get windows to use utf8 for file names, then I think using Encode like you're doing now would be a lot cheaper, easier and quicker than changing all the web content.

    You could just set up a module of your own that implements "utf8-to-iso" versions "open", "opendir", "readir" and maybe "glob" -- you could give them names like "gr_open" or whatever, and your cgi scripts then just need to use that module and call those functions instead of the "standard" ones.

    Each function in the module would handle the encoding conversions internally, taking utf8 strings as args and giving back utf8 strings as return values. That way, you don't have to keep rewriting the same encoding conversion code over and over again.

      Thank you very much.
      It would all be easier if we can just make bloody windows to use "utf8".
      If its possible and someone knows a way to actually implement this please let me know otherwise i will leave it as it is.
        If its possible and someone knows a way to actually implement this please let me know

        Here's the sort of thing I had in mind -- it's limited but simple, and will trap the most likely problems (but you'll need to figure out what to do in your cgi application when those problems come up). I haven't tested it, except to confirm that it compiles, and to make sure that this sort of operation works as hoped for (at least, it did on macosx):

        my_open( FH, ">", "foo.bar" ) or die "foo.bar: $!"; #... sub my_open { my ( $fh, $mode, $name ) = @_; open( $fh, $mode, $name ); }
        Unfortunately, if the caller tries to pass a lexically scoped scalar as the filehandle arg, that doesn't work. There's a way around that, but I haven't tried to look it up. (Maybe other monks know how off the top of their heads.) Since the OP code appears to be using the old UPPERCASE style file handles, the module as provided should do okay.

        To work this into your cgi apps, store the code as "GreekFile.pm" in one of the @INC paths, and edit your cgi scripts that do file i/o so they include:

        use GreekFile qw/gr_open gr_opendir gr_readdir gr_glob/; # or just the relevant subset of these functions
        Then, wherever you have  open( FH, "<$filename" ) simply change that to  gr_open( FH, "<", $filename ) assuming that $filename is a utf8 string. Similarly for opendir, readdir and glob calls. Just use utf8 strings in your app -- all the conversion to and from CP1253 for file names is handled inside this module.

        Well, I have been using Linux for a while now, but it seemed to me that last time I booted it, it used utf8 as default. I may be mistaken, but I used without any problem japanese, chinese and french filenames.

        You may yet find this page interesting.