in reply to Unicode source code problem in 5.6.1

Hm, something else must be up. I downloaded the code from the node you mentioned, and it ran OK. (It printed "4"). Here's my Perl version, which is running on Windows 2000:

This is perl, v5.6.1 built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail) Copyright 1987-2001, Larry Wall Binary build 633 provided by ActiveState Corp. http://www.ActiveState. +com Built 21:33:05 Jun 17 2002
To make sure I didn't clobber any characters, I used the "D/L code" link rather than copy and paste from the browser window. I did have to remove a stray my at the top of the file, but I don't think that's related.

Lemme know if you need more details on this installation.

Replies are listed 'Best First'.
(tye)Re2: Unicode source code problem in 5.6.1
by tye (Sage) on Nov 18, 2002 at 17:47 UTC

    Perl Monks uses Latin1 which means characters outside of that must be encoded as & entities. These don't work inside of CODE tags. So Perl Monks can't properly handle code that isn't in Latin1, so you can't rely on the "d/l code" link not having done some translation.

            - tye
Re: Re: Unicode source code problem in 5.6.1
by John M. Dlugosz (Monsignor) on Nov 18, 2002 at 20:16 UTC
    Re tye's remark: I didn't think about using the HTML entities in the code block. I just pasted the code into the edit box on the form, and it transmitted the UTF-8 bytes, and looked properly when I told the browser to display the page in UTF-8. If you DL the code or copy from the browser source, it should work. If it translated to Latin1, and the characters were actually present, Perl would object to the illegal encoding after "use UTF8" had been issued. But neither alpha nor phi is present in Latin-1, so if you didn't get it right it would look really funny. Either way, you'd have noticed.

    Did you put the strict back in? Commented out, it works. With strict, it does funny things.

    —John

      Hmm. I guess that might work much of the time. Of course, the code is displayed incorrectly.

      When you download the code, you should get the correct byte stream but tagged as Latin-1. If the code is saved in a UTF-8-aware file system (since you are trying to write code in UTF-8), the bytes would be converted from Latin-1 to UTF-8 which would give you different bytes. Even if you save the code using only one-byte characters, translation could happen because the browser knows the operating system expects results in something besides Latin-1, like an OEM encoding (such as "code page 437" in Windows).

      I'd think that most current "save as" operations would just save bytes and ignore encodings so you'd get the desired byte values. But I wouldn't bet on that.

              - tye
        I agree, a cut&paste will probably make it worse, not copy the actual byte stream. But, I think telling IE to display the page as UTF8 it overrides the charset setting. That's what it's supposed to do: take the existing byte stream from the server unchanged, but use the interpretation I specify since I presumably know better than the page author.

        This presumably changes its mind about the encoding, simply overriding any other way of making the determination. So when I "copy" selected text from the browser window, it knows its UTF8 when it copies it to the clipboard and marks it accordingly, or converts to UTF-16 itself and puts that on the clipboard. That means that a Paste should work properly.

      This is also kind-of a reply to tye's (valid) point. The file doesn't have any HTML entities. Here's the top lines of what od says about the file (I have cygwin on the Win 2K machine):

      0000000 u s e s t r i c t ; \r \n u s +e 0000020 w a r n i n g s ; \r \n u s e 0000040 u t f 8 ; \r \n \r \n m y $ 316 261 += 0000060 5 ; \r \n m y $ 316 246 = 4 ; \ +r
      Note the variable names look like two octal bytes. So I suspect tye's right: I still don't have exactly what John entered, but what I did have worked as expected.

      Also, I tried it with and without strict. With strict I get the expected:

      > perl -w ca21hp4a.pl
      Global symbol "$╬▒" requires explicit package name at ca21hp4a.pl line 5.
      Execution of ca21hp4a.pl aborted due to compilation errors.
      
      I had to use <pre> tags instead of <code> tags in the above snippet to make those characters show up, although they still got turned into HTML entities.

      Waah, this encoding stuff is too confusing.

        That's right: the UTF-8 encoding of the greek symbols are two bytes long. The first one will be (in binary) 110xxxxx and the second one 10xxxxxx, where the x's are the actual code point value up to 11 bits. In octal, that means a leading 3 and a leading 2, respectivly.

        By "as expected", you mean the same results I got, not the previosly expected results as defined in the docs, right?

        The funny chars in the error message are due to the Console window using a different code page. It's using a DOS-compatible OEM code page, probably 437. You can change that via "MODE CON CP SELECT=1251" to match the GUI, but that won't help here since there is no UTF-8 setting. Redirect it to a file and view with a UTF-8 editor.