Persistant Encodings Problems

Nik has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Persistant Encodings Problems by Joost (Canon) on Oct 13, 2006 at 10:50 UTC
Sigh Nik - you should know better than this now... This encodigns drive me nuts: I create a new txt file inside the 'data/text/path' and if i save it with wordpad as normal .txt it wont display at all withing my embedded avascript inside index.pl What's in the new txt file? which "embedded avascript in index.pl"? There is none in your enclosed code. If i save it as utf8 i see funny chars. Which funny characters? I just dont get it i did the same with all the other text files, why not this be seen normally? Which other text files? How do you mean, "normally"? The origin of the txt source was form a webpage at google Which page? Does this matter? Why do you assume we would know when you don't provide us with ANY useful information? Does the attached code have anything to do with the problem or do you just attach it to every post you make lately? I can see a few problems, but I've commented about them already. Read the replies to your previous posts. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: Persistant Encodings Problems by Nik (Initiate) on Oct 13, 2006 at 13:40 UTC
You are right. I should be more specific but i was too tired with this problem over and over and even irc help didn t come successfull. a) The new txt file is a normal txt file which i try to save normayll in windows as 'utf8' although something tells me it keeps saving it as 'iso-8859-7' or cp1253. The javascript has nothign to do with it. Its that i ahve this feeling that windows saves differenty the filenames and different the file names contents(encodings iam talking) b) Characters like ? ?? ??? ? / ??? ? ??? c)The other files are .txt files i ceated in windows and then tranfered then to linux to transorm them form 'what_on_eartch_win_encode_them' => 'utf8' d) The last file i had problem with was only created in xp and tried with wordpad to be saved as 'utf8' which i dont i beleive the contents saves ok in utf8 but the filename itslef saves in greek-sio e) Page that i retrived the source is: http://www.adslgr.com/forum/archive/index.php/t-34485.html Lets hope this helps, cause those bloody encodings and the way windows saves filenames and filename contents are giving me headaacke months now. If you want any other details plz ask me to provide it. My site is http://nikos.no-ip.org please feel free to do your testing especially when comes to select entries from the drop down menu.	[reply]
Re: Persistant Encodings Problems by graff (Chancellor) on Oct 14, 2006 at 06:19 UTC
The problem seems to be not so much with the encoding as with your attitude toward it. With each new post on this same topic, you seem to be telling us you are not learning what you need to learn, and you appear to have a deliberate resistance to learning. The OS you are using (the particular version of MS-Windows on your machine) is storing file names in directories using CP1253. Get used to that. Deal with it. The content inside the files -- that is, the character encoding used for the text data -- might or might not be under your direct control, and might vary from one file to the next. Figure out some minimal diagnostic that you can use on any given data file to work out which encoding it uses. This is not so difficult when the likely alternatives are either utf8 or CP1253. Use the diagnostic in your cgi script if you have to, but better yet, use it as a separate sanity check on file contents on some regular basis: choose one encoding that would be most convenient in general, and run a stand-alone process on your data files that will check whether the text content uses that encoding. If you find files that use the "wrong" encoding, convert them to the "right" encoding (i.e. whichever one you've chosen as your "standard"). You'll probably be better off you you decide that file contents (text data) should be in utf8; you could use this tool I posted a while back to diagnose the utf8 content of data files, and see what sorts of problems you might be having with that. Other tools are probably available as well -- try googling for "utf8 validation". You might recall that in one of your previous threads, I suggested a custom module for you (I called it "GreekFile.pm"); if you store that module in the same place as your cgi script, you could write your cgi script like this (at least, based on the parts of the code posted above that made sense): use GreekFile qw/gr_glob gr_open/; binmode STDERR, ":utf8"; # might help for error reporting # ... my @files = gr_glob( "../data/text/*.txt" ); # @files is an array of u +tf8 strings # because the gr_glob function in GreekFile.pm handles that conversi +on my %display_files; # let's store file names in a hash for ( @files ) { my $f = $_; $f =~ s/\.txt//; $display_files{$f} = $_; # use hash keys for display, hash values + as file paths } # ... my $passage = param( 'select' ) \|\| "blah blah (in Greek)"; # $passage is assumed to be in utf8 if ( exists( $display_files{$passage} )) { gr_open( FILE, "<:utf8", $display_files{$passage} ) or die "$0: $display_files{$passage}: $!"; local $/; my $data = <FILE>; # $data is assumed to be utf8 # ... do stuff with $data ... } [download] Notice how the cgi app does not need to worry about encoding the file names to CP1253 -- the "GreekFile" module handles that. (Bear in mind that I did not test the module thoroughly -- I don't have an MS-Windows system, let alone one that stores file names with CP1253 characters. Also notice the extra care in the error report for "die" -- this makes it easier to look for evidence in the web server's error log. (You do look at the error log, don't you?) If you are going to reply again that this is too hard and confusing, you might as well get another job.	[reply] [d/l]