in reply to Memory Efficient Alternatives to Hash of Array

Hm. presumably, you've only used <DATA> by way of example, as Perl would die just trying to load the script if it was 4GB+ in size.

Next. Why are you using a HoAs? On the basis of what you've posted, you have one key and one value per key, so wrapping that one value in an array just uses ~50% more memory than needed!

That is, changing:push @{ $hold{$elem[0]} }, $elem[1];

to $hold{ $elem[0] } = $elem[1]; would contain the same information but use 50% less memory to do so.

But either way, you've still got too much data to hold in memory on a 32-bit machine, and (on the basis of your script(s to date)), as the only reason for loading it is to sort it, you'd be far better off sorting it (the input file) externally and processing it line by line.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Memory Efficient Alternatives to Hash of Array
by tilly (Archbishop) on Dec 27, 2008 at 14:56 UTC
    FYI Perl stops processing when it sees __DATA__ so there would be no problem loading a script that is over 4 GB of size. As for the use of a hash of arrays, reading the post I would assume a badly chosen data sample rather than a misunderstanding.

    Update: Good catch, eye. The sample ws well chosen.

      ...I would assume a badly chosen data sample...
      Actually, the OP's example has three sets of duplicate tags:
      Lines 6 - 9: TGATACGGCGACCACCGAGATCTACACTCTTTCC Lines 15 - 17: TGCTCCGGCGACCACCGAGATCTACACTCTTTCC Lines 19 - 20: TTCTCCTTCGACCACCGAGATCTACACTCTTTCC
      As for the use of a hash of arrays, reading the post I would assume a badly chosen data sample rather than a misunderstanding.

      Given the OPs description of the code: "My code below, tries to group the 'error_rate' (second column of data) based on its corresponding tag (first column of data).", in conjunction with that the second column appears to be a byte-wise mask for the first:

      AATACGGCCACCCCCCCCCCCCCCGCCCCTCCCC INILILFIIIIQNQQNQNLLKFKNCDHA?DAHHH

      I don't think it is just badly chosen sample data. Maybe the OP will tell us which is correct?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        From the OP's text and code I thought that the OP wanted to know all of the possible values for the second field for each possible value of the first field. Given that the first field repeats, this requires an array.