in reply to Re: Handling HUGE amounts of data
in thread Handling HUGE amounts of data

Someone has suggested packing the data

You forgot who?

This doesn't attempt to perform your required processing, but just demonstrates that it is possible to have two 8400x17120 element datasets in memory concurrently, provided you use the right formats for storing them.

From what you said, @aod only ever holds a single char per element, so instead of using a whole 64-byte scalar for each element, use strings of chars for the second level of @aod and use substr to access the individual elements.

For @aob, you need only integers, so use Tie::Array::Packed for that. It uses just 4-bytes per element instead of 24, but as it is tied, you use it just as you would a normal array.

Putting those two together, you can have both your arrays fully populated in memory and it uses around 1.2GB instead of 9GB as would be required with standard arrays:

#! perl -slw use strict; use Tie::Array::Packed; #use Math::Random::MT qw[ rand ]; $|++; my @aod = map { 'd' x 17120; } 1 .. 8400; ## To access individual elements of @aod ## instead of $aod[ $i ][ $j ] use: ## substr( $aod[ $i ], $j, 1 ); my @aob; for ( 1 .. 8400 ) { printf "\r$_"; tie my @row, 'Tie::Array::Packed::Integer'; @row = map{ 1e5 + int( rand 9e5 ) } 1 .. 17120; push @aob, \@row; } ## For @aob use the normal syntax $aob[ $i ][ $j ] ## but remember that y0u can only store integers <>; print $aob[ $_ ][ 10000 ] for 1 .. 8400;

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Handling HUGE amounts of data
by Dandello (Monk) on Jan 31, 2011 at 05:51 UTC

    99.99% there - it ran out of memory when I hit the close button on the little Perl/Tk popup that comes up at the end to announce the data run was done.

    Converting @aod into a string was a big improvement, but so was finding an array that was hiding in a sub routine. Sometimes you're just too close to see things.

    Since I know the final user (my boy child) will want even more data, there's still a little more work to do.

    #model 1; sub popnum1 { ( $x, $y, $z ) = @_; if ( $y == 0 ) { $aob[$x][0] = $initial + $z; } else { if ( substr ($aod[ $y-1],$x,1) ne 'a' ) { $aob[$x][$y] = $initial + $z; } else { $aob[$x][$y] = $z + $aob[$x][ $y - 1 ]; } } return $aob[$x][$y]; }

    This is one version of the @aob generator. It's called only when the corresponding element in @aod is an 'a' (so it varies from one row to the next. $z is a freshly generated random number (floating point decimal plus or minus) - got rid of another memory eating array in favor of a single variable.

    So @aob is the last big array to be tamed. But I'm gaining on it.;)

      So @aob is the last big array to be tamed.

      Did you try Tie::Array::Packed?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I'll try to decipher its vagaries when I've had some sleep. But it looks promising.

        I'm just glad the blasted program's working adequately now.

        In tab delimited, the final data file comes in at 382 Meg.

      Well, it still throws an 'out of memory' when I close the little Perl/Tk that announces the script has finished running.

      I assume I've done this right as BrowserUk suggested using Tie::Array::Packed to save on RAM:

      tie @aob, 'Tie::Array::Packed::DoubleNative'; #model 1; sub popnum1 { ( $x, $y, $z ) = @_; if ( $y == 0 ) { $aob[$x][0] = $initial + $z; $zaza = $aob[$x][0]; } else { if ( substr( $aod[ $y - 1 ], $x, 1 ) ne 'a' ) { $aob[$x][$y] = $initial + $z; $zaza = $aob[$x][$y]; } else { $aob[$x][$y] = $z + $aob[$x][ $y - 1 ]; $zaza = $aob[$x][$y]; } } return $zaza; }

      I figure that returning a single variable ($zaza)is more efficient than returning $aob[$x][$y] - it's hard to tell.

        I figure that returning a single variable ($zaza)is more efficient than returning $aob$x$y

        Returning $aob[$x][$y], is returning a single variable. Whether you derefence the arrays here:

        $zaza = $aob[$x][$y];

        Or here:

        return $aob[$x][$y];

        Makes no difference.

        However, using my for ( $x, $y, $z ) & $zaza would make some difference as lexicals are more efficient than globals. Plus you could then benefit from use strict.

        But your subroutine can be refactored as:

        sub popnum1 { my( $x, $y, $z ) = @_; if ( $y == 0 ) { return $aob[$x][0] = $initial + $z; } else { if ( substr( $aod[ $y - 1 ], $x, 1 ) ne 'a' ) { return $aob[$x][$y] = $initial + $z; } else { return $aob[$x][$y] = $z + $aob[$x][ $y - 1 ]; } } }

        which saves a temporary variable and two, double dereferences.

        Personally, I think I'd code that as:

        sub popnum1 { my( $x, $y, $z ) = @_; return $aob[ $x ][ $y ] = $y && substr( $aod[ $y - 1 ], $x, 1 ) ne 'a' ? $initial + $z : $z + $aob[$x][ $y - 1 ]; }

        Though I'd want to verify that my logic transformation was correct. That should be appreciably more efficient than your original above.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.