ph0enix has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I'm using Tie::RDBM with postgresql to store some data. My problem is that the values obrained back from the database does not have set unicode (utf8) flag and are treated as a octets instead of strings.

Yes, of course I can use $data = decode('utf8', $hash{'key'}) for each data (both keys and values) obtained from database, but... Is there better way to do this? Is there something like DBM filers (filter_fetch_key, filter_fetch_value)?

My testing code is here

#!/usr/bin/perl_parallel -w # For Emacs: -*- mode:cperl; mode:folding; -*- use strict; use utf8; use Tie::RDBM; use Encode; my %data; my $db_name = 'unicode'; my $db_host = 'localhost'; my $db_user = 'elric'; my $db_pass = 'test01'; tie(%data, 'Tie::RDBM', { db => "dbi:Pg:dbname=$db_name;host=$db_host;", user => $db_user, password => $db_pass, table => 'Demo', create => 1, drop => 1, autocommit => 1, DEBUG => 0 }) or die $!; my $counter = 0; open DATA, '<:utf8', 'input.utf8' or die $!; while (<DATA>) { chomp; my ($key, $value) = split(':', $_, 2); $data{$key} = $value; # BAD - print octets print $data{$key}, "\n"; # OK - print string print decode('utf8', $data{$key}), "\n"; $counter++; } close DATA; print 'number of keys witten: ', $counter, "\n"; print 'number of keys in database: ', scalar keys %data, "\n"; # BAD - print octets print join(', ', keys %data), "\n"; # OK - print strings print join(', ', map { decode('utf8', $_)} keys %data), "\n"; untie %data;

Thanks for your help

Replies are listed 'Best First'.
Re: restore unicode data from database?
by graff (Chancellor) on Dec 01, 2002 at 23:12 UTC
    The man page for the "Encode" module in Perl 5.8.0 points out the following, under the heading "The UTF-8 flag", sub-heading "Messing with Perl's internals":
    The following API uses parts of Perl's internals in the current implementation. As such, they are efficient but may change. ... _utf8_on(STRING) [INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is not checked for being well-formed UTF-8. Do not use unless you know that the STRING is well-formed UTF-8. Returns the previous state of the UTF-8 flag (so please don't treat the return value as indicating success or failure), or "undef" if STRING is not a string.
    I'm not saying that this is a good alternative to the work-around that you are already using. I have a hunch that anything else, that would actually treat the RDBM as a utf-8 source, would require mucking with the DBD or Tie:RDBM module internals, and would be problematic, since a database (and any interface to it) needs to be flexible about handling many types of non-ASCII data -- not just unicode characters.

    Personally, I'd be content with the method you are already using.