A Character Set Enquiry

Godsrock37 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: A Character Set Enquiry by pc88mxer (Vicar) on Jul 10, 2008 at 21:06 UTC
Perl doesn't have a preferred character set. The preferred representation for text in a perl program is to use code-points which is 'character set' independent. When you export your text data to file or database you'll need to set up the correct code-point to character set mapping for the file or database. This mapping is called an encoding. Ideally, this is how your program would operate: 1) It reads the UTF-8 byte stream and decode it into code-points. 2) It encodes the code-points back into UTF-8 for storage into the database. 3) When reading from the database, it decodes the data return from the database back into code-points. 4) When printing the data to the user, it encodes the code-points via the encoding suitable for display to the user's screen. So there are a lot of places where the handling of the text can get screwed up. In fact, it is possible that your data is stored correctly in the database, but it is only when you print it out that it doesn't look right. You'll have to debug each step of the process to determine where your text is not being handled correctly. Here is generally how to handle each of the four situations above: `use Encode; # case 1 - reading from a file open(F, "<:utf8", ...); # or use binmode # case 2 - storing text into a database $sth = $dbh->prepare("INSERT INTO ... VALUES (?)"); $sth->execute( encode("utf8", $text) ); # case 3 - reading from a database my @vals = $dbh->fetchrow_array; @vals = map { decode("utf8", $_) } @vals; # case 4 - writing to a file or STDOUT binmode STDOUT, ":utf8"; print $text;` [download] You should also consult your database documentation to see if its doing any encoding translation under the hood. A useful routine I've used a lot to debug these problems is: `sub ord_dump { join(' ', map { ord($_) } split(//, $_[0])); } print ord_dump($text), "\n";` [download]	[reply] [d/l] [select]
Re^2: A Character Set Enquiry by moritz (Cardinal) on Jul 10, 2008 at 21:21 UTC
Perl doesn't have a preferred character set. Not quite true. If you read binary data, and try to treat it as text data (like using uc or lc) it's treated as Latin-1. In fact, it is possible that your data is stored correctly in the database, but it is only when you print it out that it doesn't look right. Very unlikely if he dumped UTF-8 data into a Latin-1 database and then converted it to UTF-8	[reply]
Re^3: A Character Set Enquiry by ysth (Canon) on Jul 11, 2008 at 03:38 UTC
By default, arbitrary data with the utf8 flag on will be treated as unicode characters (equivalent to latin-1 through codepoint 255). But by default without the flag on, it is treated as specified by the C locale, which is pretty much just ASCII. Try it: (remove the -CO if you have a non-utf8 terminal) `$ perl -CO -wle'print lc "\xc9"; print lc substr "\x{100}\xc9", 1'` [download] This outputs É then é. -- Online Fortune Cookie Search	[reply] [d/l]
Re: A Character Set Enquiry by moritz (Cardinal) on Jul 10, 2008 at 21:09 UTC
What character set does perl use? When you read strings in perl, shuffle them around and don't do much more, perl treats the strings as binary data. Your Ω in UTF-8 is looks like this: `echo -n "Ω"\|hexdump -C 00000000 e2 84 a6` [download] (The Omega character in the paste isn't showing correctly in code examples, imagine it being there instead of the HTML escape sequence) When you import that into a Latin1 database, it interprets that as a sequnce of Latin1 characters, which is `"āč¦"` in your case. Now you said you converted that to utf-8. A Latin1 `"\x{e2}"` becomes `c3 a2`, or `ā` as a character. Now you have to reverse that process step by step. I wish you much patience, and a good read of Encode, perluniintro and perlunicode. Or if you have the chance, restore your data from a backup, and dump it into an utf8 database in the first place.	[reply] [d/l] [select]
Re^2: A Character Set Enquiry by Godsrock37 (Sexton) on Jul 11, 2008 at 12:50 UTC
Thanks everyone for your help... I love this place Special thanks to you moritz because that makes the most sense and actually what you showed as being the results of the different encodings is exactly what I get. (I see āč¦ in the database and when i tried converting it back I got the ā... I'm impressed) I'm working on a solution now... I think I have it set from here. Something along the lines of converting it from UTF-8 -> Latin 1 -> UTF-8. Interesting tidbit that might end up mattering... if I use a function to detect the encoding all of the offending material (crazy characters) is UTF-8 encoded and all the stuff thats just plain text is considered ASCII... but it all went through the same process... whats with that? The language I'm using now (have to use it other than perl to do some things) has some nice functions for string character set encoding, but it seemed like i was getting nowhere. Now I know where I need to be going. Thanks again...	[reply]
Re^3: A Character Set Enquiry by moritz (Cardinal) on Jul 13, 2008 at 16:38 UTC
Try something along these lines: `#!/usr/bin/perl use strict; use warnings; use Encode qw(from_to decode encode); my $str = '...'; my $encoed_utf8 = from_to($str, 'UTF-8', 'ISO-8859-1'); my $decoded = decode('UTF-8', $str); my $finally_utf8 = encode('UTF-8', $decoded); print $finally_utf8, $/;` [download] I have no idea if it actually works, but it's worth a try. The language I'm using now (have to use it other than perl to do some things) has some nice functions for string character set encoding, but it seemed like i was getting nowhere That doesn't surprise me. Encoding guessing relies on characteristics of human language to get it right (for example every UTF-8 file is also a valid Latin-1 file, but it usually doesn't make much sense for a human), so it is bound to fail if your data contains rubbish encoded into UTF-8.	[reply] [d/l]
Re^3: A Character Set Enquiry by Godsrock37 (Sexton) on Jul 11, 2008 at 14:18 UTC
Is there a difference between decoding and encoding? It's all so confusing... wouldnt decoding just be the same thing as encoding to the original? iow: encoding latin1 to utf8 is the same thing as decoding utf8 (to latin1)? whats the difference? im trying to do some hex examples for myself but im having some trouble... sigh... I've decided I hate character sets	[reply]
Re^4: A Character Set Enquiry by massa (Hermit) on Jul 11, 2008 at 16:07 UTC
Re: A Character Set Enquiry by waba (Monk) on Jul 11, 2008 at 17:15 UTC
If your DB is MySQL, you can force it to reinterpret your data as a given charset. This trick may be applicable to other databases, but I have no idea.	[reply]