comment on

Update: The gist is that you are using the wrong tools in the wrong way. Your for(sort(keys())){->{$_)}} is wasteful, your use of Tie::RDBM is wasteful. Middle ground is finding a way to get your database to work incrementally instead of the all at once, universe smashing mode you've adopted.

You don't understand what your module is doing for you. I took a moment to read the code for the module and this is actually using the maximum amount of memory it can. Effectively - the module first does a select $self->{'key'}, $self->{'value'}, $self->{'frozen'} from $self->{'table'}. The first thing you should notice about that query is that it isn't in any particular order and that it loads the entire result into memory at once. Or actuallly... first the PostgreSQL backend server reads the entire thing from disk into memory, transfers it via shared memory to the Perl client which also has 100% of the table in memory and then your Tie::RDBM just iterates over the result set. (It may just load the entire key-set instead but that still may be non-trivial) If you've never actually bothered to read DBD::Pg you'll benefit from doing it now. The gist of it is that DBD::Pg doesn't have any way of only loading the data in bits as it arrives - it must allocate the whole shebang at once and work with things from there. It's unfortunate but it's reality. (you should also note that you're iterating the database more than once because keys() is one iteration. It's also a complete copy of all the keys so that's more memory. Then you go back and look up the entries by key value which requires more iteration...)

All this means is you're taking the wrong approach to your code and you made a faulty assumption. Oh well, so try again. I'd suggest one of two things. If you want to keep using PostgreSQL then move away from DBI and consider the plain Pg module. It's a wrapper to the plain PostgreSQL interface and from there you can access cursors and asynchronous queries. That might be a neat way for you to keep your current database and still get good performance.

Now here's how I would solve the problem. Dump PostgreSQL and move to BerkeleyDB. I'll let you know right now that there are two databases I just love to work with from perl - PostgreSQL and BerkeleyDB. PostgreSQL because it's a great RDBMS and has many of the features I want in database. BerkeleyDB because for some applications I can code something up that optimizes it's database access better than PostgreSQL ever seems to. Usually this is when I have to deal with lots of very simple data and I'm avoiding DBI::Pg because of no cursor/asynchronous query support (yes, perhaps going to Pg could fix that but I just haven't taken the time yet). BerkeleyDB is just the grown-up version of DB_File. Generally I use it in OO mode and don't bother with the tied interface. If it were more convenient to use the tied interface then I'd use that too - it's also good (and just ever so slightly slower).

I took a minute and started re-writing your code to use BerkeleyDB's Btree style database. An important feature of that database is that it's already sorted by key. If you iterate over it via a cursor then you get it in the order already. I had to stop with the code because you didn't include enough information to know how things are structured overall. Hopefully this post helps you get somewhere useful.

package Elric::Lexicon;

sub new {
  # so a lexicon is basically a hash (of entries)
  return bless {}, shift;
}

sub info {
  my $self     = shift;
  my $num_keys = 0;
  my $num_mu   = 0;
  my $key;
  my $value;
 
  &message("Buckets used/allocated:\t".scalar(%{$self})."\n");
  my $cursor = $self->db_cursor;
  while ($cursor->c_get( $key, $value, DB_NEXT) ) {
    $num_keys++;
    $num_mu += # Do something with $value
  }
}
 
sub bind {
  my $self = shift;
  my $type = shift;
  my $lang = shift;

  if ($type eq 'BerkeleyDB') {
    my $esd = \%Elric::System::default;

    # test if connectdb called before
    if (!defined $$esd{db}->{pass}) {
      message("error: can't bind... use connectdb first\n");
      return 0;
    }

    eval "use BerkeleyDB;" unless defined $INV{'BerkeleyDB.pm'};
    $self = BerkeleyDB::Btree->new
      ( -Filename => $$esd{filename},
        -Flags    => DB_CREATE )
      or die $BerkeleyDB::Error;

  return 1;
}


This code is called from the main program 

use Elric::Lexicon;
my %lexicon = ();
# ...
$lexicon{'EN'} = Elric::Lexicon->new;
$lexicon{'EN'}->bind('rdbm', 'EN');
# ...

my $mind_units = 0;

foreach (sort keys %lexicon) {            # iterate all present lexica
  &message("Lexicon $_:\t");              # name the current lexicon
  $mind_units += $lexicon{$_}->info();    # print information about it
+s internals
}
[download]

__SIG__
use B;
printf "You are here %08x\n", unpack "L!", unpack "P4", pack
  "L!", B::svref_2object(sub{})->OUTSIDE;
[download]

In reply to Re: iteration through tied hash eat my memory by diotalevi
in thread iteration through tied hash eat my memory by ph0enix

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


more useful options
	PerlMonks