The merge requires all the sorted intermediate files to be open at once: if you have a lot of files, this may run out of filehandles. You can:

If the original files are small compared to the amount you can sort, reading and sorting several together in the first phase reduces the scale of the merging phase(s). But a good, general solution would allow for an arbitrarily large number of files in the merge.

The merge itself requires you to maintain a list of the current record for all the files to be merged. Each time you output a record, you read from its file and decide which is the next record to be output. In general, there are a number of ways to do this:

In Perl, the trivial way to do this is to simply replace the record you just wrote with the next from its respective file, and sort the list -- you probably have to have a very large number of files to merge before it's worth using a heap (unless there's a good heap module somewhere -- see heap in CPAN).

If you know something about the keys, you could partition the original files, sort the partitions and concatenate. You have a trade-off here between the complexity of many smaller partitions (easier to sort, but harder to partition) and few larger partitions (harder to sort if exceed capacity of in memory sort). The target is to be able to sort each partition in memory, if that requires more partitions than filehandles, then you're back to doing your own buffering:

  1. set up 'n', empty partition buffers and 'n' empty parition files.

  2. read each input file in turn, allocate each record to its partition.

  3. when a partition buffer fills, append it to its respective partition file.

  4. once all input files are done, read each partition (in key order !), sort it and append to the output.

The attraction of this is that the partition phase is reasonably straightforward and the sorting step can be reduced to almost any size by increasing the number of partitions. The downside is that if your "guess" about how to partition is wrong, you can end up at step (4) with one or more partitions which are too big to sort directly. Now you have a choice: either repartition (which has the advantage of using the same code, but the disadvantage of having to know more about the keys) or split, sort and merge (which has the disadvantage of being yet more code).

The sorting of large files is an interesting problem in itself. What it comes down to is that you can only sort what you can fit in memory; after that, its all merging. Everything else is one or other way of dividing the problem down. In general, you can speed up the sorting by having an auxiliary array of pointers to the records, and sorting that array -- this reduces the amount of copying. Happily, Perl arrays are implicitly arrays of pointers to the value of entry -- so you don't need to worry about that.

I still like Knuth for the fundamentals... don't dismiss the discussion of tape sorting entirely, though you might want to think "file handle" instead of "tape". You may also see discussion of drum storage... don't worry about that.


In reply to Re^2: how to read multiple file and sort the lines by gone2015
in thread how to read multiple file and sort the lines by newPerlr

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.