in reply to Re: A Character Set Enquiry
in thread A Character Set Enquiry

Thanks everyone for your help... I love this place

Special thanks to you moritz because that makes the most sense and actually what you showed as being the results of the different encodings is exactly what I get. (I see âè¦ in the database and when i tried converting it back I got the â... I'm impressed)

I'm working on a solution now... I think I have it set from here. Something along the lines of converting it from UTF-8 -> Latin 1 -> UTF-8. Interesting tidbit that might end up mattering... if I use a function to detect the encoding all of the offending material (crazy characters) is UTF-8 encoded and all the stuff thats just plain text is considered ASCII... but it all went through the same process... whats with that?

The language I'm using now (have to use it other than perl to do some things) has some nice functions for string character set encoding, but it seemed like i was getting nowhere. Now I know where I need to be going. Thanks again...

Replies are listed 'Best First'.
Re^3: A Character Set Enquiry
by moritz (Cardinal) on Jul 13, 2008 at 16:38 UTC
    Try something along these lines:
    #!/usr/bin/perl use strict; use warnings; use Encode qw(from_to decode encode); my $str = '...'; my $encoed_utf8 = from_to($str, 'UTF-8', 'ISO-8859-1'); my $decoded = decode('UTF-8', $str); my $finally_utf8 = encode('UTF-8', $decoded); print $finally_utf8, $/;

    I have no idea if it actually works, but it's worth a try.

    The language I'm using now (have to use it other than perl to do some things) has some nice functions for string character set encoding, but it seemed like i was getting nowhere

    That doesn't surprise me. Encoding guessing relies on characteristics of human language to get it right (for example every UTF-8 file is also a valid Latin-1 file, but it usually doesn't make much sense for a human), so it is bound to fail if your data contains rubbish encoded into UTF-8.

Re^3: A Character Set Enquiry
by Godsrock37 (Sexton) on Jul 11, 2008 at 14:18 UTC

    Is there a difference between decoding and encoding?

    It's all so confusing... wouldnt decoding just be the same thing as encoding to the original? iow: encoding latin1 to utf8 is the same thing as decoding utf8 (to latin1)? whats the difference?

    im trying to do some hex examples for myself but im having some trouble... sigh... I've decided I hate character sets

      • encoding means taking a Perl string and transforming it in a binary buffer that represents the same string, in some codepage.
      • decoding is the opposite: taking a binary buffer that you know is encoded in some codepage and transforming it in a Perl string
      Perl strings are arrays of codepoints (Unicode characters) that can represented internally either as a Latin1-encoded buffer or as a (relaxed) UTF-8 encoded buffer. So, in your examples, there is no "encoding from XXX to XXX" (unless you use Encode::from_to, that is a decode followed by an encode). You can get a string that you know it's in the UTF-8 codepage, decode it (save it in Perl internal format) and then encode that internal string, for instance to the Latin1 codepage.
      []s, HTH, Massa