Hash of Arrays - Pushing Array references and memory impacts question

alittlebitdifferent has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, The following code "Works" but I don't think it should. Can anyone explain why the code seems to work and does not give me random memory garbage? My program is eating up a lot of memory using this code and I am trying to find a way to store this data on disk, but using Tie or other methods doesn't seem to take the referenced array data into account..only the array references(being scalar and all that) - Anyway - here is some summarised code with the few specific lines detailed in full:

my %bigHash
my @Arraybits

do this a few times
{

      @Arraybits = ["data","that","is","unique", "for","each","loop"];

      my @tmparray = @Arraybits;

      until (We have done this a few times)
      {

          push @{ $bigHash {$Key} }, \@tmparray; 
    
          #OR push @{ $bigHash {$Key} }, [@tmparray];

      }

@tmpxmlarray = [];
undef (@tmpxmlarray);
}
[download]

This code eats up memory really quickly. I put about 80,000 arrays into the hash this way. When I debug it I can see the full data set in the Hash and work on it whilsts the program is running However, if I tie the Hash to a DB_File or other db...I seem only able to store the array references on Disk..and my RAM still fills up, storing what I assume is all the array content. I would have thought that if I use undef, the data would be marked as feww in memory again...but al the array content is available...I just can't seem to get it out of RAM and onto disk. Any suggestions on an easy way to move this Hash of Arrays to disk including the array data and avoid eating up RAM?

Comment on Hash of Arrays - Pushing Array references and memory impacts question Download Code

Replies are listed 'Best First'.
Re: Hash of Arrays - Pushing Array references and memory impacts question by Marshall (Canon) on Oct 31, 2011 at 16:40 UTC
Another possibility in a different direction is to consider using SQlite. All you have to do install DBD:SQLite, then just "use DBI;". DBI will figure out what to do from the connect statement. SQLite avoids all the account setup and admin headaches of a traditional SQL server - it stores the data as just a single regular file and their are no "accounts". I've found the performance to be very good and this solution scales easily. One nice feature is that the amount of cache that it uses can be varied dynamically. I run it way up to speed up indexing operations and then run it back down for normal operation. I don't know enough about your application to say for sure that this is a good idea for you or not. But this has become my "go to" solution for disk resident DB. It supports a big subset of SQL, but you can use it in a simple way without having to become an SQL guru. Maybe you just have a single un-normalized table and index one column as the "key". Update: this idea would be appropriate if it helped somehow in the processing of this huge hash, if you had to search for stuff that would be part of the "values" to the keys? If the job is just a matter of retrieving the set of data associated with a single key, I would think that browserUk's idea of making the "data" a single string instead of a reference to an array of strings would make a lot of sense. This also reduces memory requirements somewhat as a single string takes less memory than an array of strings.	[reply]
Re: Hash of Arrays - Pushing Array references and memory impacts question by moritz (Cardinal) on Oct 31, 2011 at 15:15 UTC
`@Arraybits = ["data","that","is","unique", "for","each","loop"];` That places an array ref into the first element of an array -- is that really what you want? Anyway, for nested data structures you could use DBM::Deep for on-disc storage. Perl 6 - second systems done right	[reply] [d/l]
Re: Hash of Arrays - Pushing Array references and memory impacts question by BrowserUk (Patriarch) on Oct 31, 2011 at 15:33 UTC
Any suggestions on an easy way to move this Hash of Arrays to disk including the array data and avoid eating up RAM? The simple way to work around the limitation of the simpler (but usually much faster) tied DBs aversion to references, is to join your arrays into single strings for storage and split them on retrieval. With a judicious use of separator, -- say `$;` for ascii stuff or maybe `"\xfe\xff"` for Unicode? -- this can be surprisingly efficient. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Hash of Arrays - Pushing Array references and memory impacts question by CountZero (Bishop) on Oct 31, 2011 at 21:34 UTC
Rather than posting some non-functional pseudo-code, you should give us a working program that exhibits the (un)wanted effects. Especially with problems of speed and memory-consumption it is essential to see the actual code, otherwise you are unlikely to get useful answers. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^2: Hash of Arrays - Pushing Array references and memory impacts question by alittlebitdifferent (Initiate) on Oct 31, 2011 at 23:29 UTC
A sincere thankyou to everyone for their assistance and ideas, To Moritz: Ofcourse!! Slaps Head I saw that reference in Dumper and thought it was wierd - Fixed! Thankyou To BrowserUK: Very smart. This was a great idea which I hadn't tried. I started to write it when I made the decision to drop HAshes. I think it would have worked really well, but I just decided that given the amount of seaches I am doing on this data set the SQLite idea would set me up for some further expansion. However, certainly joining the data would have allowed it to live 'inside' the hash and made it storable which would be fantastic. To ~~David ~~: Excellent Point. I had switched strict off temporarily and had forgotten to enable it again. Saw that ARRAY issue immediately. Thankyou. To Marshall: I am going to try using SQLite. I was concerned as I use Cygwin on Windows XP, that CPAN wouldn't install it correctly for me..however it did and so I think it might be a great solution. Its a little disappointing to drop the Hash as I had come up with a number of extremely create referecing mechanisms and subroutine calls to the arrays....but the amount of data I am dealing with is immense..(1.5GB of text per run) and I do agree having access to SQL queries will benift me in the long run. Thankyou for convincing me I should just bite the bullet. To CountZero: I was too embarrassed about other bits of my code to have you guys look at it but I appreciate your point and will provide a working program in future requests. Although I will say that thankfully, this time around, the answers here were very useful.	[reply]
Re: Hash of Arrays - Pushing Array references and memory impacts question by ~~David~~ (Hermit) on Oct 31, 2011 at 16:03 UTC
You need to use strict and warnings... You are undef'ing an array that doesn't exist...	[reply]
Re: Hash of Arrays - Pushing Array references and memory impacts question by locked_user sundialsvc4 (Abbot) on Nov 01, 2011 at 13:47 UTC
There is a bit of a catch-22 situation here. The real problem with “memory” is that in the end it is virtual memory, with randomly-occurring page faults depending on just how much “locality of reference” your program displays (as well as the total amount of the memory resource you are consuming vs. the amount available in the system). Even when you “tie” an array/hash, the access pattern is still random, so there is still a lot of in-memory buffering going on, thus, still a lot of paging activity and a large working-set size. I think that you will just have to re-design the thing... take a fundamentally different approach. You might need to work with “80,000 data sets,” but surely they don’t have to be “in memory” simultaneously. If you are comparing two data sets, surely there is opportunity for a (SQLite...?) `JOIN`, and so forth. Avoid doing “sorting and searching” in-memory, and avoid writing procedural code insofar as possible.