in reply to Re^2: read and sort multiple files
in thread read and sort multiple files

Instead of using merge-sort alone ( as a plain approach ) a hybrid way of using in-memory sorting and merge-sort can be combined and used together.

That's what the post to which you replied already suggested.

Out of the total 'n' number of files, sort only 'm' files in memory

A 100MB file takes up pretty major chunk of memory already. Remember, if the array isn't preallocated to hold enough lines, twice the size of the data is needed.

If I were to re-implement the work in Perl, I'd probably do something equivalent to

  1. cat * | (cd tmp; split --lines=XXX -bytes=YYY - chunk)
    This maximizes memory usage while limiting memory usage.
  2. for f in tmp/chunk* ; do sort $f >$f.sorted ; done
    The sorting would actually be done before writing out the chunk.
  3. Merge file pairs until only one file remains.

Update: I struck out a statement that's probably wrong. There is overhead, but it should be proportional to the number of lines, not number of bytes.

Replies are listed 'Best First'.
Re^4: read and sort multiple files
by matrixmadhan (Beadle) on Dec 02, 2008 at 11:38 UTC
    This maximizes memory usage while limiting memory usage.

    Sorry, I don't understand the above comment at all. Would you mind explaining that?

    2. for f in tmp/chunk* ; do sort $f >$f.sorted ; done The sorting would actually be done before writing out the chunk.

    This cannot be guaranteed, what if the file is too big

      Sorry, I don't understand the above comment at all. Would you mind explaining that?

      Let's say you have 100 files to process:

      1. 2,250,000 lines
      2. 500,000 lines
      3. And the rest are all 10,000 lines long

      Let's say you have enough memory to handle 1,000,000 lines. Let's ignore byte size for simplicity, but you'll I placed a limit on that as well in the split example.

      You have two problems:

      1. You have many files to merge.
      2. One of your files is too big.

      The first problem is easy to fix: Just concatenate all the files. Then you're left with one file that's too big.

      1. 3,730,000 lines

      The second problem is easy to fix as well: Just split the file into smaller pieces. Just keep those pieces as big as possible.

      You end up with:

      1. 1,000,000 lines
      2. 1,000,000 lines
      3. 1,000,000 lines
      4. 730,000 lines

      If all you had done was split the large file, you would have 101 files to sort and merge, resulting in 100 merges. By concatenating first, all you had to do is 4 (long) merges. It cuts down on overhead a little.

      This cannot be guaranteed, what if the file is too big

      Can't be. It sorts the split chunks, not the original files.

        Yes, that's right.

        Sorting on the split chunks isn't free as well as there is additional 'n-1' open and close involved with the extra 'n-1' files when working with split chunks of a file