comment on

This unicode in windows discusses what you need. The tutorials mentioned above are good for utf8 on Unix, but doesn't give a working example for UCS-2 on Windows.

And now for some additional background:

Technically, a file isn't in 'unicode'. Unicode is a (large) set of characters and a file is a series of bytes. The way in which you interpret the series of bytes in a file as characters is called an 'encoding'.

One common encoding is 'utf8'. This has the happy property that it is the same as ASCII, over the range of ACSCII (i.e. 0->127). It is able to represent more characters than ASCII by making use of the 128->255 range. However, this wouldn't be nearly enough to cover all the characters in Unicode (> 65,000) and so a variable-width encoding scheme is used where some characters are represented as 1 byte (e.g. the ASCII characters), some as 2-byte sequences, some as 3- etc.

The other encoding you're likely to care about is common in the Windows world. This is UCS-2, and is a fixed-width encoding where two-bytes are used to represent all characters. This is generally what people from a Windows background mean when they say "Unicode string" or "wide character string". I think that technically it can't cover all of Unicode (since I think there are > 65536 Unicode chars) but it does cover nearly all of it (googles...Ah...OK. The set of Unicode characters has grown beyond 65536 since UCS-2 was a good idea. UCS-2 is a fixed width encoding, like ASCII on a bigger scale. A variable-width encoding based on UCS-2 but allowing full coverage of Unicode is UTF16, which appears to come in big- and little-endian variants. Hmm.)

This is all very unpleasant and complicated and has vexed technical people for some time.

However, Perl 5.8 and higher has good support for reading and writing files in various Unicode encodings. See perldoc uniintro and perldoc unicode for the perl docs.

What you can do is specify additional layers to open (or specify them later with binmode) to tell perl that you are reading and writing in a particular encoding. If you're on a Unix-box this generally means just doing a binmode FH, ':utf8';, but on Windows things seem to be more unpleasant (due to shenannigans with CRLF mappings).

I think the magic that you want is open($fh, "<:raw:encoding(utf16le)", $file) for reading, but at least this post seems to think you want open(my $FH, ">:raw:encoding(UTF16-LE):crlf:utf8", $file) (for writing).

Try playing around with some combination of these and report back :-)

Update: fix some of the more egregious speeling mistooks and bad wording.

In reply to Re: Unicode2ascii by jbert
in thread Unicode2ascii by Haspalm2

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.