Re: iteration through tied hash eat my memory

in reply to iteration through tied hash eat my memory

Update: The gist is that you are using the wrong tools in the wrong way. Your for(sort(keys())){->{$_)}} is wasteful, your use of Tie::RDBM is wasteful. Middle ground is finding a way to get your database to work incrementally instead of the all at once, universe smashing mode you've adopted.

You don't understand what your module is doing for you. I took a moment to read the code for the module and this is actually using the maximum amount of memory it can. Effectively - the module first does a select $self->{'key'}, $self->{'value'}, $self->{'frozen'} from $self->{'table'}. The first thing you should notice about that query is that it isn't in any particular order and that it loads the entire result into memory at once. Or actuallly... first the PostgreSQL backend server reads the entire thing from disk into memory, transfers it via shared memory to the Perl client which also has 100% of the table in memory and then your Tie::RDBM just iterates over the result set. (It may just load the entire key-set instead but that still may be non-trivial) If you've never actually bothered to read DBD::Pg you'll benefit from doing it now. The gist of it is that DBD::Pg doesn't have any way of only loading the data in bits as it arrives - it must allocate the whole shebang at once and work with things from there. It's unfortunate but it's reality. (you should also note that you're iterating the database more than once because keys() is one iteration. It's also a complete copy of all the keys so that's more memory. Then you go back and look up the entries by key value which requires more iteration...)

All this means is you're taking the wrong approach to your code and you made a faulty assumption. Oh well, so try again. I'd suggest one of two things. If you want to keep using PostgreSQL then move away from DBI and consider the plain Pg module. It's a wrapper to the plain PostgreSQL interface and from there you can access cursors and asynchronous queries. That might be a neat way for you to keep your current database and still get good performance.

Now here's how I would solve the problem. Dump PostgreSQL and move to BerkeleyDB. I'll let you know right now that there are two databases I just love to work with from perl - PostgreSQL and BerkeleyDB. PostgreSQL because it's a great RDBMS and has many of the features I want in database. BerkeleyDB because for some applications I can code something up that optimizes it's database access better than PostgreSQL ever seems to. Usually this is when I have to deal with lots of very simple data and I'm avoiding DBI::Pg because of no cursor/asynchronous query support (yes, perhaps going to Pg could fix that but I just haven't taken the time yet). BerkeleyDB is just the grown-up version of DB_File. Generally I use it in OO mode and don't bother with the tied interface. If it were more convenient to use the tied interface then I'd use that too - it's also good (and just ever so slightly slower).

I took a minute and started re-writing your code to use BerkeleyDB's Btree style database. An important feature of that database is that it's already sorted by key. If you iterate over it via a cursor then you get it in the order already. I had to stop with the code because you didn't include enough information to know how things are structured overall. Hopefully this post helps you get somewhere useful.

package Elric::Lexicon;

sub new {
  # so a lexicon is basically a hash (of entries)
  return bless {}, shift;
}

sub info {
  my $self     = shift;
  my $num_keys = 0;
  my $num_mu   = 0;
  my $key;
  my $value;
 
  &message("Buckets used/allocated:\t".scalar(%{$self})."\n");
  my $cursor = $self->db_cursor;
  while ($cursor->c_get( $key, $value, DB_NEXT) ) {
    $num_keys++;
    $num_mu += # Do something with $value
  }
}
 
sub bind {
  my $self = shift;
  my $type = shift;
  my $lang = shift;

  if ($type eq 'BerkeleyDB') {
    my $esd = \%Elric::System::default;

    # test if connectdb called before
    if (!defined $$esd{db}->{pass}) {
      message("error: can't bind... use connectdb first\n");
      return 0;
    }

    eval "use BerkeleyDB;" unless defined $INV{'BerkeleyDB.pm'};
    $self = BerkeleyDB::Btree->new
      ( -Filename => $$esd{filename},
        -Flags    => DB_CREATE )
      or die $BerkeleyDB::Error;

  return 1;
}


This code is called from the main program 

use Elric::Lexicon;
my %lexicon = ();
# ...
$lexicon{'EN'} = Elric::Lexicon->new;
$lexicon{'EN'}->bind('rdbm', 'EN');
# ...

my $mind_units = 0;

foreach (sort keys %lexicon) {            # iterate all present lexica
  &message("Lexicon $_:\t");              # name the current lexicon
  $mind_units += $lexicon{$_}->info();    # print information about it
+s internals
}
[download]

__SIG__
use B;
printf "You are here %08x\n", unpack "L!", unpack "P4", pack
  "L!", B::svref_2object(sub{})->OUTSIDE;
[download]

Comment on Re: iteration through tied hash eat my memory Select or Download Code

Replies are listed 'Best First'.
Re: Re: iteration through tied hash eat my memory by ph0enix (Friar) on Dec 09, 2002 at 15:42 UTC
Originally I started with BerkeleyDB but switch to the PostgreSQL because filesize limitation (can't create table larger than 2GB) and concurrent acces requirement. I want to continue using PostgreSQL and I also need tied solution with nested data structures support. So I slightly modify Tie::RDBM for binary data storage support... Is there better way how to use PostgreSQL with tie interface? I prefer to use existing module instead of writing the new one.	[reply]
Re^3: iteration through tied hash eat my memory by diotalevi (Canon) on Dec 09, 2002 at 15:50 UTC
Excuse me!? BerkeleyDB does databases up to 4 terabytes, transactions, concurrant access and cursors. Perhaps your operating system can't handle files that large but BerkeleyDB is perfectly fine with them. PostgreSQL avoids that OS limit by splitting database files at one gig. Which operating system are you using? If you want to continue to use PostgreSQL then you'll need to start accessing it smarter (cursors/asynchronous) or just waste minimal memory. In general a tied PostgreSQL interface really isn't the right solution for this (again, unless you do cursors or asynchronous). Really, do this The Right Way. Update: It also occurs to me that if your dataset that large it's already mandatory that you do things the smart way. Using cute gimicks is nice but you need to be intentional in how you approach your disk access and memory usage. You really can't afford not to. Update again: If you haven't already then you really need to read the BerkeleyDB document from SleepyCat. The Pod documentation in the CPAN module is really just a gloss on how to translate BerkeleyDB idioms into perlcode. You have to read the actual documentation to get at the right code. For instance - you only get concurrant access if you initialize that subsystem. The Pod documentation barely mentions it - it's fully covered in the library docs. So go read that. It's online at http://www.sleepycat.com/docs/index.html. You probably want to read the bit on the C API since that's where the CPAN module links in. `__SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;` [download]	[reply] [d/l]
Re^4: iteration through tied hash eat my memory by diotalevi (Canon) on Dec 09, 2002 at 19:20 UTC
I did more checking. This time on doing cursors in perl. The key here is to go to the PostgreSQL web site and search the archives for 'cursors perl'. That yields the example (from http://archives.postgresql.org/pgsql-general/2001-01/msg01569.php: `use Pg; my $conn = Pg::connectdb("dbname = test"); my $result = $conn->exec("begin work"); $result = $conn->exec("declare c1 cursor for select foo from bar"); $result = $conn->exec("fetch forward 1 in c1;"); print "Hurray, I fetched a row: ", $result->fetchrow, $/; $result = $conn->exec("end work;");` [download] `__SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;` [download]	[reply] [d/l] [select]
Re:^4 iteration through tied hash eat my memory by ph0enix (Friar) on Dec 09, 2002 at 16:40 UTC
I'm using SuSE Linux 7.3 (Intel 32-bit) which contains Berkeley DB database library in version 3.1.17. When file size with database raised to 2GB I get message like 'File size limit exceeded (SIGXFS2)' (not exactly - translated from localized message). I'm able to create larger files on filesystem (tested up to 6GB). Does it mean that db package in my distribution was miscompiled?	[reply]
Re:^5 iteration through tied hash eat my memory by diotalevi (Canon) on Dec 09, 2002 at 18:11 UTC
Re: Re: Re: iteration through tied hash eat my memory by iburrell (Chaplain) on Dec 09, 2002 at 20:05 UTC
You would need to modify the modules to be more efficient with memory usage with huge tables. You could either modify Tie::RDBM to use the Pg module for direct PostgreSQL access with cursors. Or modify DBD::Pg to use cursors. Otherwise, the SELECT statement to fetch the keys stores all the data on the client. You might want to investiage MySQL and see if DBD::Mysql uses cursors. It also looks like you will have to modify Tie::RDBM to not cache the values that it fetches. The important question to ask is whether you really need a tied hash. If you don't need one, then you can do operations more efficiently with custom database accesses. Since PostgreSQL doesn't use cursors, you will have to avoid any queries that return all the keys. For example, counting the number of keys is doable with a `SELECT COUNT() FROM table` statement. You might want to normalize the database to store each value in a row of a second table. Then you won't need to use Perl serialization to store the data structures. You can the total number of values with `SELECT COUNT() FROM values`. And individual counts for each key: `SELECT COUNT(*) FROM values WHERE key = $key`.	[reply] [d/l] [select]
Re^4: iteration through tied hash eat my memory by diotalevi (Canon) on Dec 09, 2002 at 20:20 UTC
Since PostgreSQL doesn't use cursors, you will have to avoid any queries that return all the keys. PostgreSQL does use cursors and so can perl: Re^4: iteration through tied hash eat my memory. Don't misinform.	[reply]

In Section Seekers of Perl Wisdom