comment on

Perl doesn't have a preferred character set. The preferred representation for text in a perl program is to use code-points which is 'character set' independent. When you export your text data to file or database you'll need to set up the correct code-point to character set mapping for the file or database. This mapping is called an encoding.

Ideally, this is how your program would operate:

1) It reads the UTF-8 byte stream and decode it into code-points.

2) It encodes the code-points back into UTF-8 for storage into the database.

3) When reading from the database, it decodes the data return from the database back into code-points.

4) When printing the data to the user, it encodes the code-points via the encoding suitable for display to the user's screen.

So there are a lot of places where the handling of the text can get screwed up. In fact, it is possible that your data is stored correctly in the database, but it is only when you print it out that it doesn't look right. You'll have to debug each step of the process to determine where your text is not being handled correctly.

Here is generally how to handle each of the four situations above:

use Encode;

# case 1 - reading from a file
open(F, "<:utf8", ...); # or use binmode

# case 2 - storing text into a database
$sth = $dbh->prepare("INSERT INTO ... VALUES (?)");
$sth->execute( encode("utf8", $text) );

# case 3 - reading from a database
my @vals = $dbh->fetchrow_array;
@vals = map { decode("utf8", $_) } @vals;

# case 4 - writing to a file or STDOUT
binmode STDOUT, ":utf8";
print $text;
[download]

You should also consult your database documentation to see if its doing any encoding translation under the hood.

A useful routine I've used a lot to debug these problems is:

sub ord_dump {
  join(' ', map { ord($_) } split(//, $_[0]));
}
print ord_dump($text), "\n";
[download]

In reply to Re: A Character Set Enquiry by pc88mxer
in thread A Character Set Enquiry by Godsrock37

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.