MrMadScience has asked for the wisdom of the Perl Monks concerning the following question:

I encountered an interesting situation in one of our hash uses. We are using BerkeleyDB::Hash to tie our hashes to files so we can access them "randomly" (over the web). However, some of the people that read these will, on occasion, forget to enter their query in the right case. I was looking at using Hash::Case::Preserve to tie the hash to allow for case-insensitive matching.
The best way I can do this so far is to do something like:
tie my %seqindex, "BerkeleyDB::Hash", -Filename => $globals{seq_xref}, -Flags => DB_RDONLY or goto bail; tie my %alt_seqindex, 'Hash::Case::Preserve'; foreach my $key (keys %seqindex){ $alt_seqindex{$key} = $seqindex{$key}; };
which means that the hash %alt_seqindex is the one I end up doing most of my searches on. It seems that building this second hash takes a very long time. I was wondering if there is a way to make that first tie statement use both the BerkeleyDB and Hash::Case::Preserve modules so that it ends up with a hash that is searchable in a case-insensitive manner?
Or if there are any other ways to do this quicker?
The hash value $globals{$seq_xref} refers to the name of the hash file that we're reading in.

Thank you

Replies are listed 'Best First'.
Re: Tie one hash two ways?
by Jenda (Abbot) on Oct 01, 2004 at 17:17 UTC

    Do you need to preserve the case of the keys? It might be easiest to convert all keys to lowercase in the hash and then lowercase the keys whenever you do any lookup.

    If you do have to keep the case you might create a lookup hash that would just map from the lowercased keys to the original ones and then do a two-step lookup:

    my %alt_seqindex; $alt_seqindex{lc $key} = $key while (my ($key) = each %seqindex);
    and then do the case insensitive lookup like this:
    $value = $seqindex{$alt_seqindex{lc $key}};
    If you do want to hide this from the users you may subclass BerkeleyDB::Hash, somewhat like this:
    package BerkeleyDB::Hash::CasePreserve; use BerkeleyDB; use base 'BerkeleyDB::Hash'; my %LCKeys; # to keep the lc $key => $key mapping for all hashes sub TIEHASH { my $obj = &BerkeleyDB::Hash::TIEHASH; my %lckeys; my $key = $obj->FIRSTKEY(); $lckeys{lc $key} = $key; $lckeys{lc $key} = $key while $key = $obj->NEXTKEY($key); $LCKeys{$obj->[0]} = \%lckeys; return $obj; } sub FETCH { my $self = shift; my $key = shift(); $key = $LCKeys{$self->[0]}->{lc $key} or return; return $self->SUPER::FETCH($key); } sub EXISTS { my $self = shift; my $key = shift(); $key = $LCKeys{$self->[0]}->{lc $key} or return; return $self->SUPER::EXISTS($key); } sub STORE { my $self = shift; my ($key, $value) = @_; if (exists $LCKeys{$self->[0]}->{lc $key}) { $key = $LCKeys{$self->[0]}->{lc $key}; } else { $LCKeys{$self->[0]}->{lc $key} = $key; } return $self->SUPER::STORE($key, $value); } sub DESTROY { my $self = shift(); delete $LCKeys{$self->[0]}; } 1;
    This implementation will be much quicker than Fletch's since it doesn't search through the whole tied hash each and every time.

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

Re: Tie one hash two ways?
by Fletch (Bishop) on Oct 01, 2004 at 16:13 UTC

    Probably the simplest thing would be to subclass BerkeleyDB::Hash and override the FETCH() and STORE() methods something like (untested):

    package BerkeleyDB::Hash::CasePreserve; sub FETCH { my( $self, $key ) = @_; unless( $self->EXISTS( $key ) ) { my $key_re = qr/\A$key\z/i; my $check = $self->FIRSTKEY( ); do { if( $check =~ $key_re ) { $key = $check; last; } } while( $check = $self->NEXTKEY( $check ) ); } return $self->SUPER::FETCH( $key ); }

    Update: Added anchored $key_re.

Re: Tie one hash two ways?
by PodMaster (Abbot) on Oct 02, 2004 at 15:05 UTC
    which means that the hash %alt_seqindex is the one I end up doing most of my searches on. It seems that building this second hash takes a very long time
    Thats because Hash::Case::Preserve does everything in memory.

    Here's what I'd do (assumes your keys don't end in "_\0_key")

    package BerkeleyDB::Hash::CasePreserve; use BerkeleyDB; use base qw[ BerkeleyDB::Hash ]; sub FETCH { $_[0]->SUPER::FETCH( lc $_[1] ) } sub EXISTS { $_[0]->SUPER::EXISTS( lc $_[1] ) } sub NEXTKEY { my $next = ""; do { $next = $_[0]->SUPER::NEXT($_[1]); } until 0 < index $next, "_\0_key", - length "_\0_key"; return $_[0]->SUPER::FETCH( $next ); # oRiGiNal cAsE } sub STORE { $_[0]->SUPER::STORE( lc($_[1])."_\0_key", $_[1], ); # oRiGiNaL cAsE return $_[0]->SUPER::STORE( $_[1], $_[2] ); }
    This has the benefit of being self contained and memory efficient.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.