in reply to Re: Handling HUGE amounts of data
in thread Handling HUGE amounts of data

This produces a file of 8400 lines of 17000 random numbers ~1GB in a little under 7 minutes.

#! perl -slw use strict; use Math::Random::MT qw[ rand ]; for ( 1 .. 8400 ) { print join ',', map 1e5 + int( rand 9e5 ), 1 .. 17000; } __END__ [11:04:48.11] C:\test>885103 > junk.dat [11:12:10.44] C:\test>dir junk.dat 30/01/2011 11:12 999,608,400 junk.dat

I appreciate that your application is doing something more complicated in terms of the numbers produced, but my point is that creating a file this size isn't hard.

So the question becomes, what problems are you having? What is it that your current code isn't doing? Basically, what is it that you are asking for help with? Because so far, that is completely unstated.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Handling HUGE amounts of data
by Dandello (Monk) on Jan 30, 2011 at 17:32 UTC

    Here's the processing file a it stands today:

    What it's doing is giving me an 'out of memory' when processing though data that should generate a 2 dimensional 'array' of 17120 elements wide and 8400 lines long.

    If I cut the number of lines down to 1200, it gets to 'write_to_output', begins to print to the file then gives me an 'out of memory' at about line 750. It also may or may not go back to the C prompt.

    If I cut the lines down to 800, it processes everything and brings up 'table1' as it should.

    However, even when it's finished writing to '$datafileout', there seems to be a several second delay after closing the 'table1' notice and the C prompt comes back.

    I'm assuming that means that some process hasn't been closed out properly, but for the life of me, I can't see what it is. All the file handles are closed and it doesn't throw any warnings.

    This is a Lenovo desktop with XP Pro and 4 Gigs of RAM.

      'out of memory' when processing though data that should generate a 2 dimensional 'array' of 17120 elements wide and 8400 lines long.... XP Pro and 4 Gigs of RAM

      A 17120x8400 array of integers requires 4.5GB of ram.

      If your XP is running 32-bit, the most ram available to perl is 2GB and you will run out of memory.

      If you are running 64-bit windows and a 64-bit perl, then you will move into swapping, and processing will get horribly slow.

      Is it necessary for your algorithm that you build the entire data structure in memory?

      Or could you produce your data one row at a time and then write it to disk before starting on the next? The example I posted above does this and only requires 9MB of ram.

      If it is absolutely necessary to build the whole thing in memory before outputting it, then I would pack the integers in memory. An array of 8400 strings length 64k (17120*4-bytes) comes to only 1/2GB. Whilst unpacking/repacking the values to perform the required math would slow things down a little, it will still be much faster than alternatives.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        If it is absolutely necessary to build the whole thing in memory before outputting it, then I would pack the integers in memory.
        vec is a usable alternative to pack/unpack, in terms of functionality, though it might be a little more userfriendly in terms of API. Or not, depending on your personal preference.