calling the interpreter from a utf8 script

emilioayllon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: calling the interpreter from a utf8 script by JamesNC (Chaplain) on Nov 18, 2003 at 23:10 UTC
Update You can indeed execute scripts written in Unicode I was incorrect --> I was able to replicate your error on a Windows box if I saved the script in Unicode using Notepad. However, I got the same script to work fine if I saved it as UTF-8. (using Perl v. 5.8). previous: I don't believe you can encode the script in UTF. You can use Perl to read various UTF Encodings to both decode(read) and encode(when writing) in various UTF formats, but your source still needs to be in ASCII if I read your post correctly.	[reply]
Re: Re: calling the interpreter from a utf8 script by etcshadow (Priest) on Nov 19, 2003 at 06:07 UTC
Well "saved in unicode" can mean many different things... utf8 is just one of many encodings in "unicode". There are others. One of the nice things about unicode is that all characters which can be encoded in the old 8-bit reprentation of 7-bit ascii are encoded exactly the same in utf8 (I think that's where the term "utf8" originates from... but I could be wrong). Anyways, looking at Textpad, which is the editor I use when on windows: it has several encoding options, among them "ANSI" (which is the closest it has to ASCII), "UTF-8", and "Unicode". Here is how they encode the character string "adsf": `[me@host]$ hd ~/asdf.unicode 0000000 a nul s nul d nul f nul 61 00 73 00 64 00 66 00 0000010 [me@host]$ hd ~/asdf.utf8 0000000 a s d f 61 73 64 66 0000004 [me@host]$ hd ~/asdf.ascii 0000000 a s d f 61 73 64 66 0000004 [me@host]$` [download] Get the idea? ------------ :Wq Not an editor command: Wq	[reply] [d/l]
Re: calling the interpreter from a utf8 script by etcshadow (Priest) on Nov 18, 2003 at 23:49 UTC
Well, the shebang line and the "use utf8" line shouldn't actually contain any characters that utf8 will encode differently than ascii will. (This is actually one of the nice things about utf8 encoding... all of the standard 7-bit ascii characters have the exact same single-byte encoding... while other characters have different, and potentially multi-byte, encodings.) The perl interpretter, much like many other applications, will read, assuming 7-bit ascii or utf8 (it doesn't really matter, if all of the characters fall within the subset that they encode the same way) up to the line that specifies the encoding. The only problem you should have is if you have any non-7-bit-ascii characters before the "use utf8" line. It may be a pain... but can you look at the first couple lines of your source through a hex-dumper, and see if your "utf-encoded" and non "utf-encoded" source are identical, byte-for-byte, for those first few lines? Oh, it could also be a version problem. If your perl interpretter is pre-5.6, then I don't think it recognizes (high-bit-set) utf8. ------------ :Wq Not an editor command: Wq	[reply]
Re: calling the interpreter from a utf8 script by Anonymous Monk on Nov 19, 2003 at 09:03 UTC
Apache has a problem with interpreting the shebang line if the script is encoded in utf-8. However, when I encode the script in utf8 the error log shows a "premature end of headers" error. You would not get that error if the script was properly utf-8 encoded. The only time I've seen that error when encoding was improper (like improperly converting a utf-8 encoded script to ascii).	[reply]
Re: calling the interpreter from a utf8 script by ysth (Canon) on Nov 18, 2003 at 23:23 UTC
Perl is supposed to be able to interpret various unicode encodings if proper Byte Order Marks appear at the beginning of the script file, but this may conflict with having your server rely on the #! line. Can you tell the server to use perl without having to check the shebang line? Don't know if/how mod_perl will react to BOMs, either. Update:Is it really utf8 or is it utf16?	[reply]
Re: Re: calling the interpreter from a utf8 script by emilioayllon (Novice) on Nov 30, 2003 at 21:26 UTC
Hi, The Bom seems to be the problem. Notepad seems to add 3 extra bytes at the beginning of the file to identify it as a utf8 file before it encodes the shebang line. I am almost sure this is the problem. Will get back with an update	[reply]