metaperl has asked for the wisdom of the Perl Monks concerning the following question:

I am using Cygwin Perl on Windows XP and am wondering what my options are for handling this problem.

Basically I read in the contents of a very large file as an array and then bucket-hash it using the Data::Bucket index() method:

=head2 index Usage : my $bucket = Data::Bucket->index(data => strings, %other) +; Purpose : Build a data structure with @strings partitioned into buc +kets Returns : An object. Argument : A list of data compatible with the compute_index() functi +on =cut sub index { my ($class, %parm) = @_; exists $parm{data} or die "Data must be passed for indexing" ; ref $parm{data} eq 'ARRAY' or die 'You must pass an array ref'; my $self = bless (\%parm, ref ($class) || $class); $self->bucket_hash; return $self; } =head2 bucket_hash Usage : Called internally by index() Purpose : Partition $self->{data} by repeated calls to $self->compute_record_index Returns : Nothing Argument : None. =cut sub bucket_hash { my ($self) = @_; for my $data (@{$self->{data}}) { my $index = $self->compute_record_index($data); my @index = ref $index eq 'ARRAY' ? @$index : ($index) ; for (@index) { push @{ $self->{bucket}{$_} } , $data ; } } return $self; }
However, the call to index led to an out of memory error and I am considering various approaches to fixing this issue:
  1. rewrite the bucket_hash method used by index() so that it writes to a SQLite database. This approach allows for in-memory or on-disk databases as needed. I'm still kicking myself for not doing this in the first place, but I figured I would never run out of memory.
  2. Some sort of method of tie a perl hashref to a disk. I think there are some modules for saving them to sleepycat db files or something... any ideas here? This would save me from rewriting my code.
  3. Increasing the virtual memory on my machine... perhaps I can just jack up the virtual memory ... but would Cygwin Perl know how to take advantage of such memory or can my hashref only occupy main physical RAM?
I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup
[tag://cpan-bucket-hash,memory,sqlite]
Enforce strict model-view separation in template engines via HTML::Seamstress The car is in the cdr, not the cdr in the car

Replies are listed 'Best First'.
Re: hashref population yields out of memory error
by kyle (Abbot) on Dec 28, 2007 at 16:11 UTC

    If you want to tie a hash to disk, DBM::Deep is a common way to go because it can handle nested structures. If you don't need that, you can use dbm modules that come with Perl.

        What's happened is that DBM::Deep has grown beyond my ability to maintain it by myself. Specifically, most of the 9 open bugs are win32/cygwin specific. I have almost zero experience developing on those platforms and am struggling. I would be absolutely delighted to take patches and would love to open up the SVN tree to other developers. Want to help? Please??

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: hashref population yields out of memory error
by locked_user sundialsvc4 (Abbot) on Dec 28, 2007 at 19:49 UTC

    You probably need to re-think your entire algorithm.

    While it is very tempting to “stuff it all into a hashref and get it back with random-access,” this is not a good approach to take when faced with very large amounts of data.

    “Memory,” after all, is virtual, and therefore a disk-file. As you seek through it randomly, page-faults occur and the system can slow down precipitously.

    A much better approach when faced with large amounts of data is to employ a disk-based sort. Yes, I am talking about sequential files! When two files are being compared and you know that both of those files are identically sorted, the process becomes very fast and quick. Furthermore, sorting is one of those algorithms that is “unexpectedly fast and efficient,” so that run-times can be markedly less ... two sorts and all ... than you might ever imagine. (Think in terms of runtimes dropping from “several hours” to “minutes,” or maybe even seconds.)

    This is how data was processed, using punched cards, long before digital computers were invented. It's what they were doing with their computers in all those sci-fi movies from the 1960's, with all those tapes spinning merrily along and ... you may have noticed ... never going backwards. (The technique they used while filming was called a “tape sort” or “polyphase sort,” and it still works.)

Re: hashref population yields out of memory error
by metaperl (Curate) on Dec 28, 2007 at 17:41 UTC
    What's odd is that I rewrote Data::Bucket to use MLDBM but memory is still being exhausted:
    sub bucket_hash { my ($self) = @_; if ($self->{on_disk}) { defined $self->{dir} and $self->{dir} = "$self->{dir}/" ; my $outfile = sprintf "%s%s", $self->{dir} , ($self->{file} || "deep.db") ; my %o; my $dbm = tie %o, 'MLDBM', 'testmldbm', O_CREAT|O_RDWR, 0640 or +die $!; $self->{bucket} = \%o; } for my $data (@{$self->{data}}) { my $index = $self->compute_record_index($data); my @index = ref $index eq 'ARRAY' ? @$index : ($index) ; for (@index) { my $tmp; if (exists $self->{bucket}{$_}) { $tmp = $self->{bucket}{$_} ; } else { $tmp = []; } push @$tmp, $data; $self->{bucket}{$_} = $tmp; # push @{ $self->{bucket}{$_} }, $data ; } } return $self; }
    I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup
    [tag://cpan-bucket-hash,memory,sqlite]
    Enforce strict model-view separation in template engines via HTML::Seamstress The car is in the cdr, not the cdr in the car