BerkeleyDB + UTF8

tfoertsch has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'am facing a problem with storing UTF8 keys via Perl's BerkeleyDB. On the command line the code works as expected:

$ perl -Mstrict -Mutf8 -MBerkeleyDB -MEncode -MData::Dumper -le '
  unlink "xx.db";
  tie my %h, "BerkeleyDB::Btree", -Filename=>"xx.db", -Flags=>DB_CREAT
+E;
  my $db=tied %h; $Data::Dumper::Useqq=1;
  $db->filter_fetch_key(
    sub {
      warn ">>fetch: ".Dumper($_);
      $_=decode("utf8", $_);
      warn "<<fetch: ".Dumper($_);
    });
  $db->filter_store_key(
    sub {
      warn ">>store: ".Dumper($_);
      $_=encode("utf8", $_);
      warn "<<store: ".Dumper($_);
    });
  $h{"ä"}=1;
  my @l=keys %h'
>>store: $VAR1 = "\x{e4}";
<<store: $VAR1 = "\303\244";
>>fetch: $VAR1 = "\303\244";
<<fetch: $VAR1 = "\x{e4}";
[download]

The store filter gets a utf8 string and converts it into a byte string. The fetch filter gets a byte string and internalizes it.

When I use the same filters in my program I get this output:

While filling the database (2 keys are stored "äü" and "ää"):

>>store_key: $VAR1 = "\x{e4}\x{fc}";
<<store_key: $VAR1 = "\303\244\303\274";
>>store_key: $VAR1 = "\x{e4}\x{e4}";
<<store_key: $VAR1 = "\303\244\303\244";
[download]

But reading back fails:

>>store_key: $VAR1 = "\x{e4}";
<<store_key: $VAR1 = "\303\244";
>>fetch_key: $VAR1 = "\x{e4}\x{e4}";
<<fetch_key: $VAR1 = "\x{fffd}\x{fffd}";
[download]

The perl snippet that produces this output looks like:

  my $check=qr/\A\Q$prefix\E(.)?/;

  $k=$prefix;
  if( ($rc=$cursor->c_get($k, $v, DB_SET_RANGE))==0 and $k=~$check ) {
    do {
      if( defined $1 ) {
        ...
      }
    } while( ($rc=$cursor->c_get($k, $v, DB_NEXT))==0 and $k=~$check )
+;
  }
[download]

$prefix is initially "ä". So the store filter called from the first c_get sees this utf8 string and converts it correctly into a byte string.

Then the fetch filter should be passed the byte string "\303\244\303\244" but it gets the utf8 string "\x{e4}\x{e4}".

So, what is wrong here?

Why do I read a byte string from the database in one case (command line) and a character string in the other?

Thanks,
Torsten

Comment on BerkeleyDB + UTF8 Select or Download Code

Replies are listed 'Best First'.
Re: BerkeleyDB + UTF8 by ig (Vicar) on Mar 05, 2009 at 20:54 UTC
It seems that c_get saves the retreived key into the passed SV without modifying its UTF8 flag then calls the fetch filter passing this SV. Thus, the UTF8 flag of $_ on entry to the fetch filter is whatever it was on $key in the call to c_get. You can see the effect in this modification of your test script. Note the difference when the UTF8 flag is turned off on $key before calling c_get in the last test. Read more... (8 kB)	[reply] [d/l] [select]
Re^2: BerkeleyDB + UTF8 by tfoertsch (Beadle) on Mar 06, 2009 at 12:40 UTC
thanks a lot. I have solved the problem by turning off the utf8 bit on entry in the fetch filter. Now I know the solution is right.	[reply]
Re: BerkeleyDB + UTF8 by glasswalk3r (Friar) on Mar 05, 2009 at 17:28 UTC
When you say: When I use the same filters in my program I get this output: what do you mean? What is this program of yours? Looks like the environment is different from running in the shell or from your program (probably something like locale). If you're using a GUI toolkit, maybe you should check the Unicode support for it. This article also should help you: http://perlgeek.de/en/article/encodings-and-unicode. Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill	[reply]
Re: BerkeleyDB + UTF8 by ig (Vicar) on Mar 05, 2009 at 19:35 UTC
It seems the difference has to do with using c_get. Read more... (2 kB)	[reply] [d/l] [select]