turquoise_man has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks I have written a script for sorting a csv file using parse::CSV. much of the info I got was off here so thanks: ) It works fine on reasonablly size csv files. But I have one that has 50 fields and approx 1400000 rows. I get the following memory related error. "out of memory during request for 4096 bytes, total sbrk() is 401864704 bytes" I was wondering of there was anyhting I could do about this basically? Can I dump info sporadically so the memory usage doesn't build up ? The code is
#!/bin/perl -w use Data::Dumper; use parse::CSV; my @csvdata; # open the file and parse my $file = Parse::CSV->new( file => 'darkint.csv', ); open($outfile, ">", 'sorted.csv') or die $!; while (my $val = $file->fetch) { push(@csvdata,$val); } if($file -> errstr){ print "$errstr error in parsing\n"; } # here we define a subroutine to sort our data in the correct manner sub tableSorter($$){ my ($row1,$row2) = @_; my $column10comparison = $row1->[9] cmp $row2->[9]; if($column10comparison != 0) { return $column10comparison; } my $column33comparison = $row1->[32] cmp $row2->[32]; if($column33comparison != 0) { return $column33comparison; } return $row1->[33] <=> $row2->[33]; } # now sort my @sorted_dat = sort tableSorter @csvdata; # arrange the sorted array into cav format my $temp = join "\n",map{ $_=join ", ", @{$_}}@sorted_dat; print $outfile $temp; # close the output file close($outfile);

Replies are listed 'Best First'.
Re: memory problems parsing large csv file
by BrowserUk (Patriarch) on Aug 24, 2009 at 13:58 UTC

    You are accumulating 5 complete copies of your data in memory: @cvsdata, @sorted_data; the list you feed into map here map{ $_=join ", ", @{$_}}@sorted_dat;; the list you feed into join here: join "\n",map{; and the entire thing as single huge string here: my $temp = join "\n",.

    By a slight rearrangement of your code you can avoid 3 of these:

    ... # now sort my @sorted_dat = sort tableSorter @csvdata; @csvdata = (); ## Discard the unsorted data ## csvify and output the sorted data one line at a time ## so avoiding two more in-memory copies # arrange the sorted array into cav format print $outfile join( ", ", @{$_} ), "\n" for @sorted_data; # close the output file close($outfile);

    In theory, there is an optimisation in newer versions of Perl that will sort an array in-place if the output and input arrays are the same:

    @csvdata = sort tableSorter @csvdata;

    But it doesn't always seem to kick in?

    Beyond that, you might need to resort to using an external sort utility (eg. your system's sort utility), though you might need to pre and/or post process your files to allow it to produce the sort order you need.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks for the replies guys. I suspect I will need a mixture of them all ! I will try your suggstions and post back when (if !) its sorted
Re: memory problems parsing large csv file
by salva (Canon) on Aug 24, 2009 at 13:44 UTC
    There are several strategies you can use depending on the file size.

    For medium size files that almost fit in the available memory:

    1. load in memory just the sorting keys and the line offsets
    2. sort this information in memory
    3. use it to create the final file seeking from line to line into the source file, reading lines following the ordered offsets and writing them to the target file.

    For really large files use Sort::External.

Re: memory problems parsing large csv file
by Tanktalus (Canon) on Aug 24, 2009 at 19:33 UTC

    It's a convoluted algorithm, that's for sure. And what else is sure is that someone has probably solved it already. There is, as mentioned, modules that can sort using the disk (Sort::External), though that says that it keeps everything in a scalar, and you really want your rows parsed - no problem, you can reparse them. Er, that seems like a waste ;-)

    Personally, my preferred manner of accessing CSV files is via DBD::CSV. In this case, I'd just use "SELECT * FROM darkint ORDER BY COL9, COL32, COL33" (with some other minor setup required - making "darkint" point to "darkint.csv", setting up the column names if they aren't in the file's first line, and possibly setting the EOL character). I think those guys have solved this problem. If not, I'd then use it as an excuse to populate a real DBMS, and, with little change, be able to continue using DBI to get my data ;-)

      DBD::CSV will read the complete CSV into memory, so it won't solve memory hog problems.

      As a side note, in the upcoming version of DBD::CSV you don't need that (sym)link anymore:

      my $dbh = DBI->connect ("dbi:CSV:", undef, undef, { f_ext => ".csv/r", f_dir => ".", f_schema => undef });

      will only open the files with a .csv extension and have it removed from the table name.


      Enjoy, Have FUN! H.Merijn
Re: memory problems parsing large csv file
by Anonymous Monk on Aug 24, 2009 at 13:44 UTC
Re: memory problems parsing large csv file
by NiJo (Friar) on Aug 24, 2009 at 18:05 UTC
    Basically you do a three level sort where lower levels are used to disambiguate ties. A more effective wheel was invented some decades ago:
    sort -k 9 -k 32 -k 33 <other options> <infile >outfile
    Edit: -d is the field delimiter for e. g. cut, but not for sort.

      Wouldn't you also need to specify -t,?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.