writing UTF-8 files

orionblue3 has asked for the wisdom of the Perl Monks concerning the following question:

I'm sort of new to working with file encoding. I need perl to write a file that is officially "marked" as a UTF-8 file. I tried:
--------------------
use Encode;
open (outf, ">:utf8",'test.txt');
print outf encode("utf8","text");
...
--------------------
and several variations with different functions that I found. But, when I open the file in some Windows program (notepad, etc) it shows it as an ANSI file, not UTF-8. What am I missing? I saw somewhere that if a string only contains ASCII data then Encode does not set the utf8 flag, and the _utf8_on() described in 'perldoc Encode' doesn't seem to exist.

The party receiving my file is complaining that my file is not officially UTF-8, so I need to figure this out. Please pardon my ignorance if I'm missing something obvious.

Any help would be appreciated. Thanks!

Comment on writing UTF-8 files

Replies are listed 'Best First'.
Re: writing UTF-8 files by idsfa (Vicar) on Mar 31, 2005 at 21:48 UTC
Valid ASCII is valid UTF-8. Your receiving party is mistaken. Updated: They are probably actually seeing that your file does not have a Byte Order Mark(BOM). This mark is neither required nor recommended, but is allowed and many Windows programs add one. If you cannot convince them to understand the UTF-8 standard, you may want to look at File::BOM: `open(FH, '>:encoding(UTF-8):via(File::BOM):utf8', $filename)` [download] April 1 Update: see also UTF-9 The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. -- Cyrus H. Gordon	[reply] [d/l]
Re^2: writing UTF-8 files by Joost (Canon) on Mar 31, 2005 at 22:24 UTC
I was under the impression that a BOM is only used for UTF-16 and UTF-32, since UTF-8 has a fixed byte-order. Apparently I was mistaken: UTF-8 may start with mark. this link has some useful information. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: writing UTF-8 files by orionblue3 (Initiate) on Apr 01, 2005 at 14:54 UTC
Thanks, guys, for clearing that up. I ended up going with File::BOM just to appease the guy receiving my file. I definately appreciate the explanation of the underlying concepts too, since this is the first time I've had to deal with encoding. Thanks!	[reply]
Re: writing UTF-8 files by graff (Chancellor) on Apr 01, 2005 at 03:04 UTC
... the _utf8_on() described in 'perldoc Encode' doesn't seem to exist. It exists, but is not exported by default; either declare that you want this exported on the "use Encode" line, or else qualify the call with the package name: `Encode::_utf8_on( $string ); # sets utf8 flag on $string` [download] Likewise for the "is_utf8( $string )" function. As mentioned previously, ASCII is a proper subset of utf8; a key feature of utf8's design is that every plain ascii text file is, by definition, a working utf8 file. Putting a BOM at the start of ASCII data is silly, but if a text file really does contain wide (non-ascii) unicode characters, which will be 2 or 3 bytes long in utf8, an initial BOM can be sort of a handy signature to put at the start of the file, to give users or apps a "heads up" about what the file contains. (It'll show up as the three-byte sequence "0xEF 0xBB 0xBF", which is the utf8 rendering of the 16-bit unicode value U+FEFF.) Still, it is technically unnecessary for utf8 in any case.	[reply] [d/l]