Unicode source code problem in 5.6.1

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unicode source code problem in 5.6.1 by VSarkiss (Monsignor) on Nov 18, 2002 at 17:36 UTC
Hm, something else must be up. I downloaded the code from the node you mentioned, and it ran OK. (It printed "4"). Here's my Perl version, which is running on Windows 2000: `This is perl, v5.6.1 built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail) Copyright 1987-2001, Larry Wall Binary build 633 provided by ActiveState Corp. http://www.ActiveState. +com Built 21:33:05 Jun 17 2002` [download] To make sure I didn't clobber any characters, I used the "D/L code" link rather than copy and paste from the browser window. I did have to remove a stray `my` at the top of the file, but I don't think that's related. Lemme know if you need more details on this installation.	[reply] [d/l] [select]
(tye)Re2: Unicode source code problem in 5.6.1 by tye (Sage) on Nov 18, 2002 at 17:47 UTC
Perl Monks uses Latin1 which means characters outside of that must be encoded as & entities. These don't work inside of CODE tags. So Perl Monks can't properly handle code that isn't in Latin1, so you can't rely on the "d/l code" link not having done some translation. - tye	[reply]
Re: Re: Unicode source code problem in 5.6.1 by John M. Dlugosz (Monsignor) on Nov 18, 2002 at 20:16 UTC
Re tye's remark: I didn't think about using the HTML entities in the code block. I just pasted the code into the edit box on the form, and it transmitted the UTF-8 bytes, and looked properly when I told the browser to display the page in UTF-8. If you DL the code or copy from the browser source, it should work. If it translated to Latin1, and the characters were actually present, Perl would object to the illegal encoding after "use UTF8" had been issued. But neither alpha nor phi is present in Latin-1, so if you didn't get it right it would look really funny. Either way, you'd have noticed. Did you put the strict back in? Commented out, it works. With strict, it does funny things. —John	[reply]
(tye)Re3: Unicode source code problem in 5.6.1 by tye (Sage) on Nov 18, 2002 at 21:22 UTC
Hmm. I guess that might work much of the time. Of course, the code is displayed incorrectly. When you download the code, you should get the correct byte stream but tagged as Latin-1. If the code is saved in a UTF-8-aware file system (since you are trying to write code in UTF-8), the bytes would be converted from Latin-1 to UTF-8 which would give you different bytes. Even if you save the code using only one-byte characters, translation could happen because the browser knows the operating system expects results in something besides Latin-1, like an OEM encoding (such as "code page 437" in Windows). I'd think that most current "save as" operations would just save bytes and ignore encodings so you'd get the desired byte values. But I wouldn't bet on that. - tye	[reply]
Re: (tye)Re3: Unicode source code problem in 5.6.1 by John M. Dlugosz (Monsignor) on Nov 18, 2002 at 21:33 UTC
Re (3): Unicode source code problem in 5.6.1 by VSarkiss (Monsignor) on Nov 18, 2002 at 20:46 UTC
This is also kind-of a reply to tye's (valid) point. The file doesn't have any HTML entities. Here's the top lines of what `od` says about the file (I have cygwin on the Win 2K machine): `0000000 u s e s t r i c t ; \r \n u s +e 0000020 w a r n i n g s ; \r \n u s e 0000040 u t f 8 ; \r \n \r \n m y $ 316 261 += 0000060 5 ; \r \n m y $ 316 246 = 4 ; \ +r` [download] Note the variable names look like two octal bytes. So I suspect tye's right: I still don't have exactly what John entered, but what I did have worked as expected. Also, I tried it with and without strict. With strict I get the expected: > perl -w ca21hp4a.pl Global symbol "$╬▒" requires explicit package name at ca21hp4a.pl line 5. Execution of ca21hp4a.pl aborted due to compilation errors. I had to use `<pre>` tags instead of `<code>` tags in the above snippet to make those characters show up, although they still got turned into HTML entities. Waah, this encoding stuff is too confusing.	[reply] [d/l] [select]
Re: Re (3): Unicode source code problem in 5.6.1 by John M. Dlugosz (Monsignor) on Nov 18, 2002 at 21:04 UTC
Re: Unicode source code problem in 5.6.1 by Thelonius (Priest) on Nov 18, 2002 at 17:47 UTC
Here's a twist. Under perl 5.8.0, compiled for cygwin, your program works fine if and only if `use strict;` is present. On the other hand, if `use strict` is commented out, I get this bizarre error: `"my" variable $strict::VERSION can't be in a package at lib/strict.pm +line 93, near "$strict::VERSION " Compilation failed in require at lib/utf8_heavy.pl line 2. BEGIN failed--compilation aborted at lib/utf8_heavy.pl line 2. Compilation failed in require at lib/utf8.pm line 17.` [download]	[reply] [d/l] [select]
Re: Re: Unicode source code problem in 5.6.1 by John M. Dlugosz (Monsignor) on Nov 18, 2002 at 20:25 UTC
So the polarity reversed in the newer one: it works with strict but fails if not strict? My code was the exact opposite. But the reason for yours is even more bizzare. I suppose they only tested it with strict enabled?	[reply]
Re: Re: Unicode source code problem in 5.6.1 by John M. Dlugosz (Monsignor) on Nov 19, 2002 at 15:16 UTC
I'm told this has been fixed in patch 17928.	[reply]
(tye)Re: Unicode source code problem in 5.6.1 by tye (Sage) on Nov 18, 2002 at 17:41 UTC
Variables whose names begin with control characters are forced into main:: no matter what package you are in. This is how things like ${^TAINT} work (which is a variable named "\ctAINT" -- note that "\ct" is CTRL-T). This sounds like a simple bug where "control character" has been implemented as something like "not ' '..'~'" or "not `/^[a-z_]/i`". Note that this bug does not require 'use utf8' as this code: `use strict; my $ė= 10;` [download] just uses plain 8-bit Latin1 and results in: `Can't use global $^= in "my", near "my $ė"` which also hints that I'm correct about the source of the bug since it reports the variable name as "$^=". Update: Ah, a different bug with "unusual" variable names. - tye	[reply] [d/l] [select]
Re: (tye)Re: Unicode source code problem in 5.6.1 by John M. Dlugosz (Monsignor) on Nov 18, 2002 at 20:23 UTC
I suspected a control-character bug, and tried having the variable begin with a regular letter. same problem. And it's not "can't use a global in my", but a totally different error, which implies that it thinks the variable is being referenced, not defined! I just tried another test, and it's not being forced into package main but can co-exist in other packages. Since it goes away when I'm not strict, it seems like the bug is in recognising a usage before a definition; once passed that, it actually works OK.	[reply]