Haspalm2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have a file saved in unicode format that I need to read into perl. I have to add an additional string to the file and the save it again to unicode. The system is a WinXp - could someone help me with guidelines to a workflow + tools to do this? Thanks Martin

Replies are listed 'Best First'.
Re: Unicode2ascii
by shmem (Chancellor) on Nov 28, 2006 at 13:22 UTC
    I guess a good starting point is the Tutorials section, namely perlunitut: Unicode in Perl.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      That's a nice link - could you please tell me - if I save a file in notepad with the encoding "Unicode" - which code is this then? I ask, because there is also an encoding called "utf-8" and there is a big difference between those two. The files that I would like to open and convert back to ansi are all saved as unicode. I hope you can help
        jbert has provided a good link.. ;-) from a quick glance I guess notepad's Unicode means UTF-16LE.

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Unicode2ascii
by jbert (Priest) on Nov 28, 2006 at 13:52 UTC
    This unicode in windows discusses what you need. The tutorials mentioned above are good for utf8 on Unix, but doesn't give a working example for UCS-2 on Windows.

    And now for some additional background:

    Technically, a file isn't in 'unicode'. Unicode is a (large) set of characters and a file is a series of bytes. The way in which you interpret the series of bytes in a file as characters is called an 'encoding'.

    One common encoding is 'utf8'. This has the happy property that it is the same as ASCII, over the range of ACSCII (i.e. 0->127). It is able to represent more characters than ASCII by making use of the 128->255 range. However, this wouldn't be nearly enough to cover all the characters in Unicode (> 65,000) and so a variable-width encoding scheme is used where some characters are represented as 1 byte (e.g. the ASCII characters), some as 2-byte sequences, some as 3- etc.

    The other encoding you're likely to care about is common in the Windows world. This is UCS-2, and is a fixed-width encoding where two-bytes are used to represent all characters. This is generally what people from a Windows background mean when they say "Unicode string" or "wide character string". I think that technically it can't cover all of Unicode (since I think there are > 65536 Unicode chars) but it does cover nearly all of it (googles...Ah...OK. The set of Unicode characters has grown beyond 65536 since UCS-2 was a good idea. UCS-2 is a fixed width encoding, like ASCII on a bigger scale. A variable-width encoding based on UCS-2 but allowing full coverage of Unicode is UTF16, which appears to come in big- and little-endian variants. Hmm.)

    This is all very unpleasant and complicated and has vexed technical people for some time.

    However, Perl 5.8 and higher has good support for reading and writing files in various Unicode encodings. See perldoc uniintro and perldoc unicode for the perl docs.

    What you can do is specify additional layers to open (or specify them later with binmode) to tell perl that you are reading and writing in a particular encoding. If you're on a Unix-box this generally means just doing a binmode FH, ':utf8';, but on Windows things seem to be more unpleasant (due to shenannigans with CRLF mappings).

    I think the magic that you want is open($fh, "<:raw:encoding(utf16le)", $file) for reading, but at least this post seems to think you want open(my $FH, ">:raw:encoding(UTF16-LE):crlf:utf8", $file) (for writing).

    Try playing around with some combination of these and report back :-)

    Update: fix some of the more egregious speeling mistooks and bad wording.

      jbert, when you open a file in your editor, how does the editor know whether two bytes next to each other represent 2 separate characters or one "utf8" character?

      For that matter, if you open a file that contains unicode encoded characters, how can it tell? If it's just a file full of bytes, wouldn't your editor just try and display each byte as its ascii representation?

        jbert, when you open a file in your editor, how does the editor know whether two bytes next to each other represent 2 separate characters or one "utf8" character?

        Simple Answer: It doesn't. Depending on the editor, it either needs to be told, requires a specific format, or requires the file to be in the encoding used by the system.

        Complex Answer: Editors can tell the difference between the different unicode encodings (but not non-unicode encodings) *if* the file starts with a Byte Order Mark. File::BOM can help you in that case.

        Update: Added to the simple answer.

        In general, it can't. But you may have system-wide defaults/policies/hints.

        On windows, it guesses, sometimes wrongly. This is the origin of the notepad bug stories which come up from time to time. (There is a Windows API function which looks at the byte stream and tries to guess. Notepad calls this function, but it isn't reliable on short, even-length strings of ASCII).

        It's also a bit more complex than that, because you can write a Byte-Order-Mark (two-byte sequence) at the beginning of the text stream, which indicates that the following characters are in a certain encoding, but this is in-band signalling, which kind of sucks, because it only really works if you know the file is already Unicode.

        This area is UTF8's strength. Since ASCII is a strict subset of UTF8, you can treat a stream of bytes as UTF8 and everything will be fine if the stream is actually ASCII. As long as the stream is one of those two, you're OK.

        So there are two main camps:

        Windows: We're slowly moving to two-bytes everywhere UCS-2. People need to guess which encoding is in use.

        Unix: We're moving from ASCII to UTF8. If your app treats text files as containing UTF8 it'll work happily with ASCII or UTF8 files.