hashref population yields out of memory error

metaperl has asked for the wisdom of the Perl Monks concerning the following question:

I am using Cygwin Perl on Windows XP and am wondering what my options are for handling this problem.

Basically I read in the contents of a very large file as an array and then bucket-hash it using the Data::Bucket index() method:


=head2 index

 Usage     : my $bucket = Data::Bucket->index(data => strings, %other)
+;
 Purpose   : Build a data structure with @strings partitioned into buc
+kets
 Returns   : An object.
 Argument  : A list of data compatible with the compute_index() functi
+on

=cut

sub index
{
    my ($class, %parm) = @_;

    exists $parm{data} or die "Data must be passed for indexing" ;
    ref $parm{data} eq 'ARRAY' or die 'You must pass an array ref';

    my $self = bless (\%parm, ref ($class) || $class);

    $self->bucket_hash;

    return $self;
}

=head2 bucket_hash

 Usage     : Called internally by index()
 Purpose   : Partition $self->{data} by repeated calls 
   to $self->compute_record_index
 Returns   : Nothing
 Argument  : None.

=cut

sub bucket_hash
{
    my ($self) = @_;

    for my $data (@{$self->{data}}) {
    my $index = $self->compute_record_index($data);

    my @index = ref $index eq 'ARRAY' ? @$index : ($index) ;
    for (@index) {
        push @{ $self->{bucket}{$_} } , $data ;
    }
    }

    return $self;
}
[download]

However, the call to index led to an out of memory error and I am considering various approaches to fixing this issue:

rewrite the bucket_hash method used by index() so that it writes to a SQLite database. This approach allows for in-memory or on-disk databases as needed. I'm still kicking myself for not doing this in the first place, but I figured I would never run out of memory.
Some sort of method of tie a perl hashref to a disk. I think there are some modules for saving them to sleepycat db files or something... any ideas here? This would save me from rewriting my code.
Increasing the virtual memory on my machine... perhaps I can just jack up the virtual memory ... but would Cygwin Perl know how to take advantage of such memory or can my hashref only occupy main physical RAM?

I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup	`[tag://cpan-bucket-hash,memory,sqlite]` [download]
Enforce strict model-view separation in template engines via HTML::Seamstress	The car is in the cdr, not the cdr in the car

Comment on hashref population yields out of memory error Select or Download Code

Replies are listed 'Best First'.

Re: hashref population yields out of memory error
by kyle (Abbot) on Dec 28, 2007 at 16:11 UTC

If you want to tie a hash to disk, DBM::Deep is a common way to go because it can handle nested structures. If you don't need that, you can use dbm modules that come with Perl.

[reply]

Re^2: hashref population yields out of memory error

by metaperl (Curate) on Dec 28, 2007 at 16:29 UTC

9 bugs

dragonchild

I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup	`[tag://cpan-bucket-hash,memory,sqlite]` [download]
Enforce strict model-view separation in template engines via HTML::Seamstress	The car is in the cdr, not the cdr in the car

[reply]
[d/l]

Re^3: hashref population yields out of memory error

by dragonchild (Archbishop) on Dec 28, 2007 at 18:34 UTC

DBM::Deep

My criteria for good software:

Does it work?
Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

[reply]

Re: hashref population yields out of memory error
by locked_user sundialsvc4 (Abbot) on Dec 28, 2007 at 19:49 UTC

You probably need to re-think your entire algorithm.

While it is very tempting to “stuff it all into a hashref and get it back with random-access,” this is not a good approach to take when faced with very large amounts of data.

“Memory,” after all, is virtual, and therefore a disk-file. As you seek through it randomly, page-faults occur and the system can slow down precipitously.

A much better approach when faced with large amounts of data is to employ a disk-based sort. Yes, I am talking about sequential files! When two files are being compared and you know that both of those files are identically sorted, the process becomes very fast and quick. Furthermore, sorting is one of those algorithms that is “unexpectedly fast and efficient,” so that run-times can be markedly less ... two sorts and all ... than you might ever imagine. (Think in terms of runtimes dropping from “several hours” to “minutes,” or maybe even seconds.)

This is how data was processed, using punched cards, long before digital computers were invented. It's what they were doing with their computers in all those sci-fi movies from the 1960's, with all those tapes spinning merrily along and ... you may have noticed ... never going backwards. (The technique they used while filming was called a “tape sort” or “polyphase sort,” and it still works.)

Re: hashref population yields out of memory error
by metaperl (Curate) on Dec 28, 2007 at 17:41 UTC

Data::Bucket

sub bucket_hash
{
    my ($self) = @_;

    if ($self->{on_disk}) {
      defined $self->{dir} and $self->{dir} = "$self->{dir}/" ;
      my $outfile = sprintf "%s%s",
    $self->{dir} , ($self->{file} || "deep.db") ;

      my %o;
      my $dbm = tie %o, 'MLDBM', 'testmldbm', O_CREAT|O_RDWR, 0640 or 
+die $!;
      $self->{bucket} = \%o;
    }

    for my $data (@{$self->{data}}) {
    my $index = $self->compute_record_index($data);


    my @index = ref $index eq 'ARRAY' ? @$index : ($index) ;

    for (@index) {
      my $tmp;
      if (exists $self->{bucket}{$_}) {
        $tmp = $self->{bucket}{$_} ;
      } else {
        $tmp = [];
      }
      push @$tmp, $data;

      $self->{bucket}{$_} = $tmp;

#        push @{ $self->{bucket}{$_} }, $data ;
    }
    }



    return $self;
}
[download]

I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup	`[tag://cpan-bucket-hash,memory,sqlite]` [download]
Enforce strict model-view separation in template engines via HTML::Seamstress	The car is in the cdr, not the cdr in the car

[reply]
[d/l]
[select]

Re^2: hashref population yields out of memory error

by metaperl (Curate) on Dec 28, 2007 at 20:18 UTC

GDBM_File

I am getting good results thus far with DB_FIle

I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup	`[tag://cpan-bucket-hash,serialization,memory,sqlite]` [download]
Enforce strict model-view separation in template engines via HTML::Seamstress	The car is in the cdr, not the cdr in the car

[reply]
[d/l]