wabbit has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to understand why PERLIO utf-8 encoding works only under some instances. Trying to generate a XML file as UTF-8, and trying to force all text output to UTF-8. Below are experimental scripts run in MS Windows using ActivePERL v5.8.8. Ultimately will run in Linux, but it is behaving the same in Windows.
# This exports text as UTF8 open(TXT, ">:encoding(utf8)", 'c:\utf8_01.txt'); print TXT <<EOP; <?xml version='1.0' encoding='UTF-8' standalone='yes' ?> EOP ; close TXT; # This exports text as Windows 1252 open(TXT, ">:encoding(utf8)", 'c:\utf8_01.txt'); print TXT <<EOP; xml version='1.0' encoding='UTF-8' standalone='yes' ?> EOP ; close TXT; # This exports text as Windows 1252 open(TXT, ">:encoding(utf8)", 'c:\utf8_01.txt'); print TXT <<EOP; <?xml version='1.0' encoding='UTF-8' standalone='yes' ?> EOP ; close TXT;

Replies are listed 'Best First'.
Re: Writing files as UTF-8
by graff (Chancellor) on Feb 19, 2007 at 07:09 UTC
    I'm afraid I don't understand what the problem is, exactly. Your three examples are all identical chunks of code (except for an apparent copy/paste error in the second one: you left out the initial "<?" in the first line of the HEREDOC (in front of "xml").

    Aside from being identical (so there should be no difference at all in the outputs), none of the examples seem to involve any wide (non-ASCII) characters. Since the 128-element ASCII table is a proper subset of both cp1252 and utf8, I would expect that there really is no difference at all in the three outputs -- they are all just plain, simple ASCII.

    There is no discernable difference between utf8 and cp1252 (or iso-8859-*, or even Asian character sets like GBK or Big5) when you're only looking at data that consists entirely of ASCII characters -- all those encodings handle ASCII the same way.

    Have you tried to output any data with non-ASCII characters? (which ones, if any?) Show us an example of the code you use to do that, and try to make it clear for us:

    • Where do the wide characters come from? (Hard-coded in the perl script? Read from some external source?)
    • What exactly shows up in the output file as a result?

    (A hex dump of the file contents would be handy, though sometimes the problem can be obvious based on how the text appears in a common Latin1 or utf8 display environment.)

Re: Writing files as UTF-8
by ady (Deacon) on Feb 19, 2007 at 07:07 UTC