chrism01 has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I'm selecting the contents of a table from a remote DB, tidying up the values, and inserting the results into a local table.
According to a mysqldump (data only) it should be 350MB of data.
However, when I try to load the remote data into an AoA, it hits the 4GB limit and crashes with 'Out of memory!'.
Why??
Core code

# Collect values while( @db_row = $sth->fetchrow_array()) { $data_rec = [$db_row[0], $db_row[1], $db_row[2], $db_row[3], $db_row[4], $db_row[5], $db_row[6], $db_row[7], $db_row[8] ]; # Add to list push(@my_arr, $data_rec); } Table def (sanitised): A int(4) unsigned B bigint(20) unsigned C char(2) D int(10) unsigned E int(10) unsigned F char(4) G tinyint(2) unsigned H int(10) unsigned I tinyint(2) Num rows: 7022790
Any ideas?

Cheers
Chris

Replies are listed 'Best First'.
Re: Out of Memory selecting from MySQL
by BrowserUk (Patriarch) on Oct 12, 2007 at 05:12 UTC

    Because structured data carries a substantial amount of overhead.

    use Devel::Size;; push @a, [(undef)x9] for 1 .. 7022790;; print Devel::Size::size( \@a );; 33554472

    That shows that an AoA containing 7022790 arrays each with 9 (undef) elements already consumes 33 MB of ram. Actually rather more, but when I try to use Devel::Size::total_size() on it, it blows my memory limits. That because total_size() itself consumes a large amount of memory in which to build a table of references so as to ensure that it doesn't count the same memory twice.

    In addition to that 33 MB, you have 7022790 x 9 = 63205110 scalars, each with (on average) 5 bytes of data + (at least) 12 bytes of overhead. That's 63205110 x 17 = 1074486870 ~= 1GB.

    But, as the array is being built, a certain amount of space is allocated to the array to start with. When that space is filled, the array must be extended, so a new lump of space (I think) twice the size the existing array is allocated, then existing data is copied over and the old one is freed. The new array then continues to be filled until it also reached it limits, at which point the process is repeated.

    The problem is, that the first lump of space is not large enough to accomodate the 3rd generation request (which now requires 4 times the original) so still more space must be requested from the OS. And when that fills the process repeats again. And again, none of the existing, freed spaces is large enough to accomodate the new request so it is back to the OS.

    Each time, the previous iterations space is not large enough to accomodate the next, so new space must be allocated. Whilst some of that freed space will be re-used for the small arrays and the scalars they contain, it fairly easy to see that as the process goes on, less and less of each previous iterations space is need to fulfill these smaller allocations and so you end up with large lumps of unused freespace until you finally reach the 4GB limit.

    The solution may be fairly simple. Preallocate the final size of the large array up front. Issue the same query but with a count(*) clause to find out how many rows you are going to get and then use that to preallocate the array:

    my @my_arr; $#my_arr = 7022790; ...

    That should avoid the doubling and redoubling of the array allocation and so may allow you to load the entire dataset.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Out of Memory selecting from MySQL
by Somni (Friar) on Oct 12, 2007 at 05:43 UTC
    It would appear BrowserUK addressed your memory issue. One thing I'd like to mention is you don't need that temporary, nor do you need to be so explicit in assigning it.

    while (my @db_row = $sth->fetchrow_array) { push @my_arr, [@db_row[0..8]]; }

      Indeed. There is also no need to duplicate the temporary array into an anonymous array. You can just take a reference to it:

      my @a; while( my @temp = split ',', <STDIN> ) { push @a, \@temp } print "@$_" for @a;; a,b,c e,f,g 1,2,3 ^Z a b c e f g 1 2 3

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Hm, true enough. I have a tendency to copy fetch_* return values, but that's because I use the *ref methods, and the documentation used to specifically say the references returned may be reused. Old habits die hard.
Re: Out of Memory selecting from MySQL
by chrism01 (Friar) on Oct 12, 2007 at 06:09 UTC
    Somni:
    Actually, I'm cleaning up the data before adding it to the final array.
    Thx anyway

    BrowserUk
    I thought it might be overhead, but I didn't realise there was that much!
    I've gone with batching 100k rows at a time
    Goes fast enough (for now), but I'll keep the pre-size thing in mind for future ref.

    This is on a prod box with 8GB RAM...

    Cheers
    Chris

      I've gone with batching 100k rows at a time Goes fast enough (for now), but I'll keep the pre-size thing in mind for future ref.

      Even with 100k rows at a time, pre-sizing the array would probably show a measurable performance improvement due to the removal of the need for doubling and copying. And it won't cost you anything.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.