comment on

The most important thing to know about Unicode in perl is that you are responsible for remembering whether a particular scalar contains "bytes", or "Unicode characters". Perl does not track this for you. If you lose track of this, you're going to have a confusing time.

So to start, if you write a unicode character in your perl source file, you have "bytes" (which may contain a valid utf-8 sequence from your code editor, but it's still "bytes"). If you put use utf8; at the start of your file, then perl understand that you are defining all your strings with "unicode characters".

"encode" means that you take Unicode Characters, and convert them to utf-8 byte sequences. i.e. the input is characters, the output is bytes. Decode is the opposite. (you may know that perl stores unicode characters internally as utf-8, but this is an implementation detail and only confuses the issue. Think in terms of characters and bytes.)

When you perform I/O, including "warn" and "print" and "readline", you are reading/writing bytes, unless you put the binmode $fh, ":encoding(UTF-8)" layer on that file handle. If you have a string known to be unicode characters, and you want to write it to a file handle without the special encoding layer, you need to "utf8::encode($str)" it yourself before writing.

When you talk to a database, you should make sure that you know whether you are receiving unicode or bytes for any given field. Most database drivers can be configured to correctly identify text columns (and assume them to be unicode) vs. binary columns of bytes.

When you write to a web API, you need to keep track of whether that API expects characters or bytes. It all has to be bytes before it goes over the wire, but sometimes the API does that step for you.

When in doubt, hex-dump the string to find out whether Perl thinks the character is "\x{100}" or "\xC4\x80". A handy utility for this is B::perlstring, though it outputs bytes in octal rather than hex.

use B "perlstring";

my $x= "\x{100}";
say perlstring $x;

utf8::encode($x);
say perlstring $x;
[download]

In reply to Re: Strings with umlauts and such by NERDVANA
in thread Strings with umlauts and such by PeterKaagman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.