artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I would like to avoid 'out of memory' in my program which is caused by big amount of data in data structures. I have to compare text for equality, and I am storing in Hash. The keys of the hash 'texts' are approximately 1000 characters and I have 500 such entries and I have to search for duplicate entries.

Which technique can help to avoid 'out of memory'.?

Thanks,
Artist

Replies are listed 'Best First'.
Re: Memory Restrictions
by graff (Chancellor) on Oct 24, 2002 at 03:05 UTC
    Depending on the relative "confusability" of the texts, you probably don't need to load the entire 1000 or so characters per text in order to establish uniqueness.

    Try working out an algorithm where you start by comparing, say, the first 80 characters of each text, put aside the ones that are unique on this basis, and only do pair-wise comparisons of texts that are identical on this basis.

    Another possibility, if "equality" is supposed to mean "byte level identity" among the texts: read each one, one at a time, and generate an MD5 signature or checksum for it. Now just compare the checksum signatures -- in fact, just sort them and look for consecutive values that are identical; those will be identical texts.

    update: If the texts vary in size, even a little bit, just comparing sizes might be a good thing to start with. If you want to ignore differences in white-space characters (like "diff -b"), read the texts one at a time and normalize white-space before noting their lengths (and generating checksums).

      I agree, MD5 is totally the way to go for this. Simple and efficient.
Re: Memory Restrictions
by atcroft (Abbot) on Oct 24, 2002 at 02:28 UTC

    You may wish to consider a tied datastructure. Two tutorials are available on the site, The Tie()s That Bind and Tie: Creating Special Objects, that might be helpful. My limited understanding is that you can tie() a datastructure such as an array or a hash to a file, so the data is maintained in the file rather than memory. A little slower perhaps, but would easy your memory requirements (although if you are low in disk space as well you might encounter problems).

    Hope that helps, and I look forward to the suggestions of others that might also be helpful.

    Update (23-Oct-2002): Another resource that might be helpful might be the perltie documentation in the standard distribution.

Re: Memory Restrictions
by seattlejohn (Deacon) on Oct 24, 2002 at 07:23 UTC
    Others have mentioned MD5, but I also have to ask: Are you sure the "out of memory" error doesn't indicate a problem somewhere in your code? 500 hash keys by 1000 chars each seems unlikely to cause problems unless the values associated with each key are truly huge.

    To test, I just wrote a program that generates 500 hash keys, each key a string of 1000 random characters, and each value a string of 1 million random characters, and it seems to work fine.

    If you are still having problems after reading the suggestions on this thread, you might post some of your code to see if anyone can provide more specific help.

            $perlmonks{seattlejohn} = 'John Clyman';

Re: Memory Restrictions
by fokat (Deacon) on Oct 24, 2002 at 03:29 UTC
    In addition to the other monks' advice, please take into consideration that hashes are not designed to be "conservative" with regard to memory consumption. They are designed to be fast. Perhaps the exception to this, is a relatively new optimization where the text for common hash keys is stored once IIRC.

    A simple algorythm would be something along the lines of this untested, pseudo-code example:

    my @uniques = (); my $md5; while (my $string = <FILE>) { $md5 = md5hex $string; if (grep { $md5 eq $_ } @uniques) { warn "$string is not unique\n"; # or push() into another list... } else { push @uniques, $md5; } } # Now @unique contains the list of unique strings
    This should use less memory than what I imagine your solution to be. Note that showing us some of your code can help us give better answers.

    Update: Ok, ok. I added the obligatory MD5 :).

    Regards.

      please take into consideration that hashes are not designed to be "conservative" with regard to memory consumption

      quite true (and ++). perl will "over allocate" memory on the assumption you're always going to need more. If you know ahead of time (or can calculate at run-time), you can prevent the over allocation by preallocating memory (checkout perldata):

      my @array; $#array = 512; # or my %hash; keys %hash = 512;

      but I don't think this is an issue with the original post.

      -derby