in reply to Handling HUGE amounts of data

Thank you all for at least looking

This is a population estimate table with a maximum population of (x axis) of 17000 for which each one has random number (6 digits) assigned to it (among other things) over 8400 years.

I've already done some rearranging of subs (like generating the randoms one row at a time.)

I was thinking some DB management might be helpful simply because I know those can get huge.

What I will probably do is break the output into interlocking chunks so that each chunk comes in at 20-40 MB instead of one output file at 400 MB.

When this project started, I was told it would be 2000 wide and 2000 tall - no problem. Then today I got the actual data - 17000 wide and 8400 tall.

Luckily this is NOT a web app - I spent a week learning Perl/Tk so it could run from a C prompt.

Replies are listed 'Best First'.
Re^2: Handling HUGE amounts of data
by BrowserUk (Patriarch) on Jan 30, 2011 at 11:21 UTC

    This produces a file of 8400 lines of 17000 random numbers ~1GB in a little under 7 minutes.

    #! perl -slw use strict; use Math::Random::MT qw[ rand ]; for ( 1 .. 8400 ) { print join ',', map 1e5 + int( rand 9e5 ), 1 .. 17000; } __END__ [11:04:48.11] C:\test>885103 > junk.dat [11:12:10.44] C:\test>dir junk.dat 30/01/2011 11:12 999,608,400 junk.dat

    I appreciate that your application is doing something more complicated in terms of the numbers produced, but my point is that creating a file this size isn't hard.

    So the question becomes, what problems are you having? What is it that your current code isn't doing? Basically, what is it that you are asking for help with? Because so far, that is completely unstated.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Here's the processing file a it stands today:

      What it's doing is giving me an 'out of memory' when processing though data that should generate a 2 dimensional 'array' of 17120 elements wide and 8400 lines long.

      If I cut the number of lines down to 1200, it gets to 'write_to_output', begins to print to the file then gives me an 'out of memory' at about line 750. It also may or may not go back to the C prompt.

      If I cut the lines down to 800, it processes everything and brings up 'table1' as it should.

      However, even when it's finished writing to '$datafileout', there seems to be a several second delay after closing the 'table1' notice and the C prompt comes back.

      I'm assuming that means that some process hasn't been closed out properly, but for the life of me, I can't see what it is. All the file handles are closed and it doesn't throw any warnings.

      This is a Lenovo desktop with XP Pro and 4 Gigs of RAM.

        'out of memory' when processing though data that should generate a 2 dimensional 'array' of 17120 elements wide and 8400 lines long.... XP Pro and 4 Gigs of RAM

        A 17120x8400 array of integers requires 4.5GB of ram.

        If your XP is running 32-bit, the most ram available to perl is 2GB and you will run out of memory.

        If you are running 64-bit windows and a 64-bit perl, then you will move into swapping, and processing will get horribly slow.

        Is it necessary for your algorithm that you build the entire data structure in memory?

        Or could you produce your data one row at a time and then write it to disk before starting on the next? The example I posted above does this and only requires 9MB of ram.

        If it is absolutely necessary to build the whole thing in memory before outputting it, then I would pack the integers in memory. An array of 8400 strings length 64k (17120*4-bytes) comes to only 1/2GB. Whilst unpacking/repacking the values to perform the required math would slow things down a little, it will still be much faster than alternatives.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.