musterion has asked for the wisdom of the Perl Monks concerning the following question:

GDBM_File does not appear to be friendly to utf8. How does one conjure GDBM_File to accept utf8 strings, Example:
require 5.8.5; use strict; no strict 'subs'; use warnings; use encoding 'utf8'; use Carp; use English; use GDBM_File; binmode STDIN, ':utf8'; binmode STDOUT, ':utf8'; binmode STDERR, ':utf8'; my $count = 0; my %keys; my $file = "FAST.gdbm"; tie (%keys, GDBM_File, $file, &GDBM_WRCREAT, 0644) || die ("could not +open:$file"); while (my $line = <STDIN>) { chomp $line; my ($key, $heading) = split (/\t/, $line); eval { $keys{$key} = $heading; }; if ("" ne $EVAL_ERROR) { print $line, "\n"; } if (0 == (++$count % 10000)) { print STDERR "$count loaded\n"; } } print STDERR "$count loaded\n";
Example input:
fst01710268 $aCinéma vérité films fst01710335 $aSchulma&#136;dchen-Report films fst01710349 $aAngélique films fst01710442 $aTrapalho&#131;es films fst01726204 $aFanto&#130;mas films fst01726458 $aFlu&#136;gelhorn music (Jazz) fst01726727 $aRomans a&#128; clef

Replies are listed 'Best First'.
Re: utf8 and GDBM
by ikegami (Patriarch) on Jun 09, 2010 at 21:39 UTC

    If GDBM expects bytes, you'll need to serialise your text into bytes. That specific type of serialisation is called encoding. The following encodes the text using UTF-8.

    Based on what Khen1950fx posted,

    #!/usr/bin/env perl use v5.8.5; # Why? use strict; use warnings; use utf8; use open ':std', ':utf8'; use English; use GDBM_File qw( GDBM_WRCREAT ); sub _e { my $s = shift; utf8::encode($s); $s } my $file = "FAST.gdbm"; tie (my %keys, GDBM_File, $file, GDBM_WRCREAT, 0644) or die ("Could not open \"$file\": $!\n"); my $count = 0; while (my $line = <STDIN>) { chomp $line; my ($key, $heading) = split (/\t/, $line); if (!eval { keys{_e($key)} = _e($heading); 1 }) { warn "Can't store record \"$line\" $@"; next; } if (0 == (++$count % 10000)) { print "$count loaded\n"; } } print "$count loaded\n";
Re: utf8 and GDBM
by Khen1950fx (Canon) on Jun 09, 2010 at 20:51 UTC
    Here's your code with tags:
    #!usr/bin/perl require 5.8.5; use strict; no strict 'subs'; use warnings; use encoding 'utf8'; use Carp; use English; use GDBM_File; binmode STDIN, ':utf8'; binmode STDOUT, ':utf8'; binmode STDERR, ':utf8'; my $count = 0; my %keys; my $file = "FAST.gdbm"; tie (%keys, GDBM_File, $file || &GDBM_WRCREAT, 0644) || die ("could no +t open:$file"); while (my $line == STDIN;) { chomp $line; my ($key, $heading) = split (/\t/, $line); eval { $keys{$key} = $heading; }; if ("" ne $EVAL_ERROR) { print $line, "\n"; } if (0 == (++$count % 10000)) { print STDERR "$count loaded\n"; } } print STDERR "$count loaded\n";