restore unicode data from database?

ph0enix has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I'm using Tie::RDBM with postgresql to store some data. My problem is that the values obrained back from the database does not have set unicode (utf8) flag and are treated as a octets instead of strings.

Yes, of course I can use $data = decode('utf8', $hash{'key'}) for each data (both keys and values) obtained from database, but... Is there better way to do this? Is there something like DBM filers (filter_fetch_key, filter_fetch_value)?

My testing code is here

#!/usr/bin/perl_parallel -w
# For Emacs: -*- mode:cperl; mode:folding; -*-
use strict;
use utf8;
use Tie::RDBM;
use Encode;

my %data;
my $db_name = 'unicode';
my $db_host = 'localhost';
my $db_user = 'elric';
my $db_pass = 'test01';

tie(%data, 'Tie::RDBM', {
 db         => "dbi:Pg:dbname=$db_name;host=$db_host;",
 user       => $db_user,
 password   => $db_pass,
 table      => 'Demo',
 create     => 1,
 drop       => 1,
 autocommit => 1,
 DEBUG      => 0
}) or die $!;

my $counter = 0;
open DATA, '<:utf8', 'input.utf8' or die $!;
while (<DATA>) {
  chomp;
  my ($key, $value) = split(':', $_, 2);
  $data{$key} = $value;
  # BAD - print octets
  print $data{$key}, "\n";
  # OK - print string
  print decode('utf8', $data{$key}), "\n";
  $counter++;
}
close DATA;

print 'number of keys witten: ', $counter, "\n";
print 'number of keys in database: ', scalar keys %data, "\n";

# BAD - print octets
print join(', ', keys %data), "\n";

# OK - print strings
print join(', ', map { decode('utf8', $_)} keys %data), "\n";

untie %data;
[download]

Thanks for your help

Comment on restore unicode data from database? Select or Download Code

Replies are listed 'Best First'.
Re: restore unicode data from database? by graff (Chancellor) on Dec 01, 2002 at 23:12 UTC
The man page for the "Encode" module in Perl 5.8.0 points out the following, under the heading "The UTF-8 flag", sub-heading "Messing with Perl's internals": `The following API uses parts of Perl's internals in the current implementation. As such, they are efficient but may change. ... _utf8_on(STRING) [INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is not checked for being well-formed UTF-8. Do not use unless you know that the STRING is well-formed UTF-8. Returns the previous state of the UTF-8 flag (so please don't treat the return value as indicating success or failure), or "undef" if STRING is not a string.` [download] I'm not saying that this is a good alternative to the work-around that you are already using. I have a hunch that anything else, that would actually treat the RDBM as a utf-8 source, would require mucking with the DBD or Tie:RDBM module internals, and would be problematic, since a database (and any interface to it) needs to be flexible about handling many types of non-ASCII data -- not just unicode characters. Personally, I'd be content with the method you are already using.	[reply] [d/l]

Replies are listed 'Best First'.

Re: restore unicode data from database?
by graff (Chancellor) on Dec 01, 2002 at 23:12 UTC

         The following API uses parts of Perl's internals in the
         current implementation.  As such, they are efficient but
         may change.

         ...

         _utf8_on(STRING)
           [INTERNAL] Turns on the UTF-8 flag in STRING.  The
           data in STRING is not checked for being well-formed
           UTF-8.  Do not use unless you know that the STRING is
           well-formed UTF-8.  Returns the previous state of the
           UTF-8 flag (so please don't treat the return value as
           indicating success or failure), or "undef" if STRING
           is not a string.
[download]

not

good

Personally, I'd be content with the method you are already using.

[reply]
[d/l]