salonmonk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I've got a rather large database which has a mix of encodings in - from windows-1250/1251/1254, ISO-8859 and utf8. I'm looking to completely move the database to utf8, but am stuck as to where I should look. Luckily I know the encoding type of each record, so it's not a matter of hunt/peck/pray, but I'm not sure how to go about it (the Encode module is also helping check my encoding assertions brilliantly!).

Would any monk care to enlighten me as to how I might consider going about the conversion please ?

Kindest...


update... ahh maybe, Unicode::MapUTF8 will help!

Replies are listed 'Best First'.
Re: Convert database into UTF-8
by ysth (Canon) on Apr 20, 2005 at 10:43 UTC
    I'm confused. Unicode::MapUTF8, based on a 30-second read of its doc, looks designed to work on pre-5.8 perls. Why would you use it rather than Encode? What do you need to do that Encode isn't doing for you?
Re: Convert database into UTF-8
by inman (Curate) on Apr 20, 2005 at 10:56 UTC
    I have used Encode on a recent project to convert between WinLatin1 and UTF8
      Encode has been working fine for me too, between ISO-8859-1 and UTF-8.

      Wonderful module :-)

        But - I just think I'm misunderstanding the documents, or the fundementals of DBI. I'm not quite sure how one would go about using Encode. Do I - encode('utf8', decode('ENCODING', $database{column})) ?
Re: Convert database into UTF-8
by salonmonk (Beadle) on Apr 20, 2005 at 13:08 UTC
    Okay, so it's just my idiocy the - encode('utf8',decode('encoding',$data)) - works a treat!

    THanks a load for all your help guys, it's much appreciated.
Re: Convert database into UTF-8
by salonmonk (Beadle) on Apr 20, 2005 at 10:54 UTC
    Yes sorry, I'm just scanning documents hopelessly looking for help. I'm not sure about Encode - when I pull my data from the database, has Perl already converted it into UTF-8 for internal use? Would this mean I need to do no work ? I would I just &decode('windows-1250', $database{column}) - meaning the values returned from the decode function would be in UTF-8 ?

      When DBI pulls text data out of your database, the data will be treated as "bytes", not as characters -- because Perl has no way of knowing what sort of character encoding has been stored in the database.

      So the data coming out of the database is a set of "octets", and needs to be "decoded" into a utf8 string within your perl script. If DBI has given you a hash keyed by column name, then:

      # I'm sure you have a very different (more sensible) way of # mapping table values to their proper legacy encodings, but # this is just to show how to handle the data: my %column_enc_map = ( columnA => 'cp1250', columnB => 'cp1251', # or whatever... ); for my $field ( keys %column_enc_map ) { # replace the hash values from the database with utf8 strings: $database{$field} = decode( $column_enc_map{$field}, $database{$fi +eld} ); } # %database values are now in utf8; you can load them back to the data +base via updates
      Looking at your later reply in this thread, I'm pretty sure you don't need the extra "encode()" step on top of the "decode". All that does is turn off the utf8 flag on the string, which is kind of pointless, I think.
        Wow, brilliant - many thanks for that explanation graff!
      perluniintro - Perl Unicode introduction