newPerlr has asked for the wisdom of the Perl Monks concerning the following question:

@filearray = <$sourePath/*>;
foreach (@filearray){
open('filename', "<$_");
push(@list,<filename>);
close('filename');
}

i use this code to read from multiple files and i sort @list and write it to file. the problem is all the files i read from are really huge some are 100MB and this is taking up all my memory. how can read multiple file and sort it and write it to a single file.

101,20060925235712445,-1,1,00000123
115,20060925235714565,-1,5,00007893
108,20060925235712445,-1,1,00456567
102,20060925235712445,-1,6,00004563

this is a sample data. i must be able to sort it by a particular data column.

Replies are listed 'Best First'.
Re: read and sort multiple files
by ikegami (Patriarch) on Dec 01, 2008 at 04:41 UTC

    Pre-sort each individual files (so you never have more than one in memory at a time), then just merge the files together. If a file is too big to be sorted, split it into smaller files first.

    This is merge sort.

      Instead of using merge-sort alone ( as a plain approach ) a hybrid way of using in-memory sorting and merge-sort can be combined and used together.

      For ex:

      Out of the total 'n' number of files, sort only 'm' files in memory

      ( m could be approximately be m ~ n/2, this is just approximation and can increase/decrease based on the in-memory available and threshold value of memory permitted to be used by the process in consideration )



      With the approach of combining merge-sort and in-memory sort

      1) Both merge-sort and the quickness of in-memory sort are being used

      2) No problem of too many files taking too much of memory, as sorting the number of files in memory is now controlled

        Instead of using merge-sort alone ( as a plain approach ) a hybrid way of using in-memory sorting and merge-sort can be combined and used together.

        That's what the post to which you replied already suggested.

        Out of the total 'n' number of files, sort only 'm' files in memory

        A 100MB file takes up pretty major chunk of memory already. Remember, if the array isn't preallocated to hold enough lines, twice the size of the data is needed.

        If I were to re-implement the work in Perl, I'd probably do something equivalent to

        1. cat * | (cd tmp; split --lines=XXX -bytes=YYY - chunk)
          This maximizes memory usage while limiting memory usage.
        2. for f in tmp/chunk* ; do sort $f >$f.sorted ; done
          The sorting would actually be done before writing out the chunk.
        3. Merge file pairs until only one file remains.

        Update: I struck out a statement that's probably wrong. There is overhead, but it should be proportional to the number of lines, not number of bytes.

Re: read and sort multiple files (wheel reuse)
by ikegami (Patriarch) on Dec 01, 2008 at 05:14 UTC
    The following unix command might do the trick
    sort -t, -k2 sourcepath/* > sorted
      Since its mentioned as too many large files, its better to use the -T option to specify a temp directory instead of filling the default /tmp directory that sort uses to store "temp" files for sorting
        You can specify, which sort method to use for sorting with perl sort function.
        Please go through the following link to get to know:
        http://search.cpan.org/~tty/kurila-1.14_0/lib/sort.pm

        If you are running in Unix or Linux like operating system, you can use shell command "sort".
        You can use Tie::File module, which will not use much memory, but this will slow down the sorting process.
Re: read and sort multiple files
by gone2015 (Deacon) on Dec 01, 2008 at 10:25 UTC
Re: read and sort multiple files
by CountZero (Bishop) on Dec 01, 2008 at 07:49 UTC
    If you have a database lying idle nearby, dump the data into the database and extract the data using standard SQL.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James