Memory Restrictions

artist has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Memory Restrictions by graff (Chancellor) on Oct 24, 2002 at 03:05 UTC
Depending on the relative "confusability" of the texts, you probably don't need to load the entire 1000 or so characters per text in order to establish uniqueness. Try working out an algorithm where you start by comparing, say, the first 80 characters of each text, put aside the ones that are unique on this basis, and only do pair-wise comparisons of texts that are identical on this basis. Another possibility, if "equality" is supposed to mean "byte level identity" among the texts: read each one, one at a time, and generate an MD5 signature or checksum for it. Now just compare the checksum signatures -- in fact, just sort them and look for consecutive values that are identical; those will be identical texts. update: If the texts vary in size, even a little bit, just comparing sizes might be a good thing to start with. If you want to ignore differences in white-space characters (like "diff -b"), read the texts one at a time and normalize white-space before noting their lengths (and generating checksums).	[reply]
Re: Re: Memory Restrictions by perrin (Chancellor) on Oct 24, 2002 at 03:24 UTC
I agree, MD5 is totally the way to go for this. Simple and efficient.	[reply]
Re: Memory Restrictions by atcroft (Abbot) on Oct 24, 2002 at 02:28 UTC
You may wish to consider a tied datastructure. Two tutorials are available on the site, The Tie()s That Bind and Tie: Creating Special Objects, that might be helpful. My limited understanding is that you can tie() a datastructure such as an array or a hash to a file, so the data is maintained in the file rather than memory. A little slower perhaps, but would easy your memory requirements (although if you are low in disk space as well you might encounter problems). Hope that helps, and I look forward to the suggestions of others that might also be helpful. Update (23-Oct-2002): Another resource that might be helpful might be the perltie documentation in the standard distribution.	[reply]
Re: Memory Restrictions by seattlejohn (Deacon) on Oct 24, 2002 at 07:23 UTC
Others have mentioned MD5, but I also have to ask: Are you sure the "out of memory" error doesn't indicate a problem somewhere in your code? 500 hash keys by 1000 chars each seems unlikely to cause problems unless the values associated with each key are truly huge. To test, I just wrote a program that generates 500 hash keys, each key a string of 1000 random characters, and each value a string of 1 million random characters, and it seems to work fine. If you are still having problems after reading the suggestions on this thread, you might post some of your code to see if anyone can provide more specific help. $perlmonks{seattlejohn} = 'John Clyman';	[reply]
Re: Memory Restrictions by fokat (Deacon) on Oct 24, 2002 at 03:29 UTC
In addition to the other monks' advice, please take into consideration that hashes are not designed to be "conservative" with regard to memory consumption. They are designed to be fast. Perhaps the exception to this, is a relatively new optimization where the text for common hash keys is stored once IIRC. A simple algorythm would be something along the lines of this untested, pseudo-code example: `my @uniques = (); my $md5; while (my $string = <FILE>) { $md5 = md5hex $string; if (grep { $md5 eq $_ } @uniques) { warn "$string is not unique\n"; # or push() into another list... } else { push @uniques, $md5; } } # Now @unique contains the list of unique strings` [download] This should use less memory than what I imagine your solution to be. Note that showing us some of your code can help us give better answers. Update: Ok, ok. I added the obligatory MD5 :). Regards.	[reply] [d/l]
Re: Re: Memory Restrictions by derby (Abbot) on Oct 24, 2002 at 12:46 UTC
please take into consideration that hashes are not designed to be "conservative" with regard to memory consumption quite true (and ++). perl will "over allocate" memory on the assumption you're always going to need more. If you know ahead of time (or can calculate at run-time), you can prevent the over allocation by preallocating memory (checkout perldata): `my @array; $#array = 512; # or my %hash; keys %hash = 512;` [download] but I don't think this is an issue with the original post. -derby	[reply] [d/l]