in reply to A Character Set Enquiry

What character set does perl use?

When you read strings in perl, shuffle them around and don't do much more, perl treats the strings as binary data.

Your Ω in UTF-8 is looks like this:

echo -n "Ω"|hexdump -C 00000000 e2 84 a6

(The Omega character in the paste isn't showing correctly in code examples, imagine it being there instead of the HTML escape sequence)

When you import that into a Latin1 database, it interprets that as a sequnce of Latin1 characters, which is "âè¦" in your case.

Now you said you converted that to utf-8. A Latin1 "\x{e2}" becomes c3 a2, or â as a character.

Now you have to reverse that process step by step. I wish you much patience, and a good read of Encode, perluniintro and perlunicode.

Or if you have the chance, restore your data from a backup, and dump it into an utf8 database in the first place.

Replies are listed 'Best First'.
Re^2: A Character Set Enquiry
by Godsrock37 (Sexton) on Jul 11, 2008 at 12:50 UTC

    Thanks everyone for your help... I love this place

    Special thanks to you moritz because that makes the most sense and actually what you showed as being the results of the different encodings is exactly what I get. (I see âè¦ in the database and when i tried converting it back I got the â... I'm impressed)

    I'm working on a solution now... I think I have it set from here. Something along the lines of converting it from UTF-8 -> Latin 1 -> UTF-8. Interesting tidbit that might end up mattering... if I use a function to detect the encoding all of the offending material (crazy characters) is UTF-8 encoded and all the stuff thats just plain text is considered ASCII... but it all went through the same process... whats with that?

    The language I'm using now (have to use it other than perl to do some things) has some nice functions for string character set encoding, but it seemed like i was getting nowhere. Now I know where I need to be going. Thanks again...

      Try something along these lines:
      #!/usr/bin/perl use strict; use warnings; use Encode qw(from_to decode encode); my $str = '...'; my $encoed_utf8 = from_to($str, 'UTF-8', 'ISO-8859-1'); my $decoded = decode('UTF-8', $str); my $finally_utf8 = encode('UTF-8', $decoded); print $finally_utf8, $/;

      I have no idea if it actually works, but it's worth a try.

      The language I'm using now (have to use it other than perl to do some things) has some nice functions for string character set encoding, but it seemed like i was getting nowhere

      That doesn't surprise me. Encoding guessing relies on characteristics of human language to get it right (for example every UTF-8 file is also a valid Latin-1 file, but it usually doesn't make much sense for a human), so it is bound to fail if your data contains rubbish encoded into UTF-8.

      Is there a difference between decoding and encoding?

      It's all so confusing... wouldnt decoding just be the same thing as encoding to the original? iow: encoding latin1 to utf8 is the same thing as decoding utf8 (to latin1)? whats the difference?

      im trying to do some hex examples for myself but im having some trouble... sigh... I've decided I hate character sets

        • encoding means taking a Perl string and transforming it in a binary buffer that represents the same string, in some codepage.
        • decoding is the opposite: taking a binary buffer that you know is encoded in some codepage and transforming it in a Perl string
        Perl strings are arrays of codepoints (Unicode characters) that can represented internally either as a Latin1-encoded buffer or as a (relaxed) UTF-8 encoded buffer. So, in your examples, there is no "encoding from XXX to XXX" (unless you use Encode::from_to, that is a decode followed by an encode). You can get a string that you know it's in the UTF-8 codepage, decode it (save it in Perl internal format) and then encode that internal string, for instance to the Latin1 codepage.
        []s, HTH, Massa