Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

MLDBM for hash > 2GiB?

by gregor-e (Beadle)
on Nov 29, 2005 at 15:44 UTC ( [id://512676]=perlquestion: print w/replies, xml ) Need Help??

gregor-e has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse some text files that are generated each month and keep a summary in a persistent multi-level hash. These text files contain around 5 GiB of information, resulting in a tied-hash .db file that is expected to be around 12 GiB once all months are loaded. Trouble is, as the MLDBM .db file rounds the 2 GiB mark, I'm getting:
lseek error at /usr/lib/perl5/site_perl/5.8.5/MLDBM.pm line 161, <IN_F +ILE> line 5038240.
At this point the MLDBM .db file looks like:
-rw-r--r-- 1 gomer user 2147479776 Nov 28 17:28 summaryDatabase.db
In this case, MLDBM was used like:
use MLDBM qw(GDBM_File Data::Dumper);
But I have also tried using Storable as the serializer, as well as the default SDBM_File TIEHASH. All configurations have thus bombed as soon as the resulting .db file crosses the 2GiB mark. Is there some configuration of TIEHASH and/or serializer that enables one to keep persistent multi-level hashes greater than 2GiB size? (Please don't tell me I should just use DBI. In this situation, that means petitioning for an Oracle installation).

The underlying Fedora Core 3 & Perl versions:

Linux rskass_arc_2 2.6.12-1.1381_FC3smp #1 SMP Fri Oct 21 04:03:26 EDT + 2005 i686 i686 i386 GNU/Linux This is perl, v5.8.5 built for i386-linux-thread-multi

Replies are listed 'Best First'.
Re: MLDBM for hash > 2GiB?
by merlyn (Sage) on Nov 29, 2005 at 15:47 UTC
Re: MLDBM for hash > 2GiB?
by perrin (Chancellor) on Nov 29, 2005 at 17:04 UTC
    You should switch from GDBM_File to BerkeleyDB. It supports file sizes of multiple terabytes. Also, use Storable, since it's faster and more compact. There is no need to use a relational database for this -- you just need a more scalable dbm.
      I installed BerkeleyDB from CPAN, changed the use to:
      use MLDBM qw(BerkeleyDB Storable);
      and gave it a kick. It moaned:
      TIEHASH is not a valid BerkeleyDB macro at /usr/lib/perl5/site_perl/5. +8.5/MLDBM.pm line 143
      Is there a way to persuade MLDBM to use BerkeleyDB? (Okay, I've only scanned the friendly manpage for BerkeleyDB briefly. I'll go back and give it a good squint.)
        Try this:

        use MLDBM qw(BerkeleyDB::Btree Storable);

        Also, there is a section on MLDBM in the BerkeleyDB docs.

Re: MLDBM for hash > 2GiB?
by jfroebe (Parson) on Nov 29, 2005 at 16:16 UTC

    Hi,

    There are several free DBMSs that you may want to look into:

      Embedded DBMSs
    1. SQLite
    2. xBase

      Standalone DBMS servers
    1. MySQL
    2. Postgres
    3. Firebird

    Each of these are fine solutions and each have DBI drivers. I would have to agree with Randal Schwartz with SQLite since we are talking about a small database and no concurrent usage by other processes.

    Jason L. Froebe

    Team Sybase member

    No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

Re: MLDBM for hash > 2GiB?
by wazzuteke (Hermit) on Nov 29, 2005 at 16:11 UTC
    I don't believe I've ever seen this 2GB barrier, unless the file system won't let you write files larger than 2GB. In these cases, I have seen big problems. Unfortunately, though, the only way to combat this is to re-compile the OS, which is usually never a viable solution.

    This note asside, I also noticed that the Perldoc notes in the WARNINGS that:

    Many DBM implementations have arbitrary limits on the size of records that can be stored. For example, SDBM and many ODBM or NDBM implementations have a default limit of 1024 bytes for the size of a record. MLDBM can easily exceed these limits when storing large data structures, leading to mysterious failures. Although SDBM_File is used by MLDBM by default, it is not a good choice if you're storing large data structures. Berkeley DB and GDBM both do not have these limits, so I recommend using either of those instead.

    Reading this, I would make sure not to use SDBM_File, ODBM, or NDBM if your structures are larger than 1024 bytes. You can always test this with Storable::freeze().

    If that doesn't help, I would recommend posting a snippet of the portion of code you are using to tie your hash to the file for writing the data in the first place. There may be ways of re-factoring it that might be benificial. Worth a shot, I suppose.

    ---hA||ta----
    print map{$_.' '}grep{/\w+/}@{[reverse(qw{Perl Code})]} or die while ( 'trying' );
Re: MLDBM for hash > 2GiB?
by creamygoodness (Curate) on Nov 29, 2005 at 16:54 UTC
    I'd be curious to see one part of the output for perl -V. The USE_LARGE_FILES option has been enabled by default since 5.6. You can confirm that it's there by looking for a line like this...

    Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL +_IMPLICIT_CONTEXT
    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com
      Yes, USE_LARGE_FILES is enabled:
      Characteristics of this binary (from libperl): Compile-time options: DEBUGGING MULTIPLICITY USE_ITHREADS USE_LARGE_ +FILES PERL_IMPLICIT_CONTEXT
Re: MLDBM for hash > 2GiB?
by gregor-e (Beadle) on Nov 30, 2005 at 15:48 UTC
    Well, perrin did get me over the 2 GiB hump with the suggestion that I use BerkeleyDB. Now, unfortunately, it complains "Out of memory!" after building about 20 GiB of .db file. (Much larger than I originally anticipated).

    So I suspect I'd best go with merlyn's suggestion of using DBD::SQLite, even if SQL does make me itch. Thanks for all your suggestions.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://512676]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-04-20 07:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found