treebeard has asked for the wisdom of the Perl Monks concerning the following question:

I have opened a large (72 mb) file that i am parsing and loading into a hash. I then sort the hash and write it to a new text file. In the process I am receiving the following system error:

Out of memory during request for 1016 bytes, total sbrk() is 670351204 bytes!

Is this a memory leak? I have run this process over and over successfully but now all I receive is this error. Here is the code:

while (<DATAFILE1>) { chomp; ($account,$time,$dept,$dimset,$sched,$emp,$type,$val) = split( +/[{]/); $emp = empnumber ($emp); push(@{$employee{$dept.$account.$emp.$subctr}},($account,$time +,$dept,$dimset,$sched,$emp,$type,$val)); $subctr++; } close(DATAFILE1); foreach $empnum (sort keys(%employee)) { my ($account,$time,$dept,$dimset,$sched,$emp,$type,$val)= @{$em +ployee{$empnum}}; string2 = $account."{".$time."{".$dept."{".$dimset."{"."SchHead +Cnt"."{".$emp."{".$type."{".$val; write(OUTPUT2); } close(OUTPUT2); sub empnumber { my ($number) = @_; my $newnumber; if ($number < 10) {$newnumber = E00.$number} elsif ($number < 100) {$newnumber = E0.$number} else {$newnumber = E.$number} return $newnumber; }

Replies are listed 'Best First'.
Re: Out of memory Error
by Ovid (Cardinal) on Aug 15, 2002 at 16:57 UTC

    Aside from the obvious fact that loading a 72MB file into a hash will take up more than 72MB, trying to sort that hash is also going to take up a lot of memory. Just glancing through Perlguts Illustrated suggests that an undef scalar will probably be at least 12 bytes: 4 for the pointer, 4 for the reference count, 3 for the flags and 1 (?) for the type. Note that this does not even account for the actual data being stored! All of this gives Perl tremendous flexibility, but it's not terribly memory efficient.

    I don't think this is a bug in Perl. You're just chewing up a lot of memory (and your machine's specs will also play a large part in this). I think you should check out a merge sort. The File::Sort module should handle this for you.

    Cheers,
    Ovid

    Note: Weird coincidence: I was updating this with some data from Perlguts Illustrated before I saw Abigail-II's response. Go vote for that node. Much better info than mine.

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Out of memory Error
by Abigail-II (Bishop) on Aug 15, 2002 at 17:04 UTC
    How many lines do you have in that file? For each line, you are creating 9 strings, and an array. Each string will take 24 bytes overhead (on top of the length to the string itself). The overhead of the array is about 50 bytes, plus pointers to the strings stored, plus some because arrays always get a bit extra space allocated. That's about 120 bytes per array, making something like 336 bytes per line in the file. And that isn't counting the overhead of the hash. Or the actual content of the file.

    So, if you have about 40 characters per line on average, the 670 Mb already allocated won't do....

    Abigail

Re: Out of memory Error
by dws (Chancellor) on Aug 15, 2002 at 17:44 UTC
    The only "value" your script adds to the data is the reformulation of the employee number. You might consider a multi-pass approach. First, write a simple script that reformulates the employee number and builds the sort key, then write the result into a temporary table, with line looking likes this:   sortable-key original-data-with-modified-empnumber Since you're processing a line at a time, the memory footprint will be small.

    Next, use an external sort program to sort the temporary file. A good sort program can handle data that won't otherwise fit into virtual memory.

    Finally, write a small script that strips the sortable-key from the file. Clean up your temporary files, and Voila!, you're done.

    This is essentially a Schwartzian Transform using intermediate files.

Re: Out of memory Error
by demerphq (Chancellor) on Aug 15, 2002 at 18:08 UTC
    While im not so sure that this is relevent your code isnt valid.
    string2 = $account."{".$time."{".$dept."{".$dimset."{"."SchHeadCnt"."{ +".$emp."{".$ty +pe."{".$val;
    So if you arent running under strict and warnings this probably isnt doing what you think its doing. Whether this has anything to do with your problem is a different question....

    Yves / DeMerphq
    ---
    Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

      Your right. I didn't declare string2 and once I did no more errors. I should run with string/warnings but I would ask if anyone can point me to a good std way of declaring all these variables. Thanks for all your help, monks. This issue threw me for a loop as my real issue was sorting these 700k rows while fixing that one field of data (E1 -> E001).
        use the my keyword.
        use strict; use warnings; my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time); # or our ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time);
        Note that my is lexical and declarative like a variable declaration in C. There is also our which is used for allowing access to a dynamic variable within a scope without rasing strict/warnings. The statement is scoped but not declarative (the variable exists regardless, but strict will complain anyway.)

        Also it looks like you have picked up some sloppy habits, like not quoting things that should be quoted. These will bite you under strict. But pain is good. Especially when you make it go away. ;-)

        Yves / DeMerphq
        ---
        Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

Re: Out of memory Error
by abitkin (Monk) on Aug 15, 2002 at 17:13 UTC
    I've had a similar problem, I was only reading a 2 meg file, and was using over 200 megs of ram; but I was doing much more with it.

    First, let me say, using hashes takes some extra space over arrays, and you should be aware of that going in.

    Second, the more modifications to data, the more Perl "e;Meta"e; data is required. From what I understand, and please correct me if I'm wrong, this meta data speeds up incremental modifications on your hashes and arrays.

    In short, it is not out of bounds to get a lot of memory usage with perl, with big, or small datasets, depending on how much you are doing with them.

    Perhaps there is a better way of splitting up the problem (maybe writing the data out to two or more files as you read based on the employee so that you have smaller datasets to work on.)

    --
    Kwyjibo. A big, dumb, balding North American ape. With no chin.