The merge requires all the sorted intermediate files to be open at once: if you have a lot of files, this may run out of filehandles. You can:
generalise the code to merge sub-sets of the sorted files, into a new, smaller sets of larger sorted files, and repeat. This is relatively straightforward, but multiplies the number of times you process each record.
do your own buffering for each input to the merge, and open/close/seek files as required. For all I know, there's a module to do this (see CPAN).
If the original files are small compared to the amount you can sort, reading and sorting several together in the first phase reduces the scale of the merging phase(s). But a good, general solution would allow for an arbitrarily large number of files in the merge.
The merge itself requires you to maintain a list of the current record for all the files to be merged. Each time you output a record, you read from its file and decide which is the next record to be output. In general, there are a number of ways to do this:
scan the list looking for the smallest (assuming ascending order sort) -- that's O(n) in the number of files to be merged.
keep the list in order, and insert each new one in place -- that's O(n/2).
if 'n' is large, I'd use a 'heap' -- which is O(log2 n). (The heap is a beautiful structure. Once you have your head around it, not a lot of code either.)
In Perl, the trivial way to do this is to simply replace the record you just wrote with the next from its respective file, and sort the list -- you probably have to have a very large number of files to merge before it's worth using a heap (unless there's a good heap module somewhere -- see heap in CPAN).
If you know something about the keys, you could partition the original files, sort the partitions and concatenate. You have a trade-off here between the complexity of many smaller partitions (easier to sort, but harder to partition) and few larger partitions (harder to sort if exceed capacity of in memory sort). The target is to be able to sort each partition in memory, if that requires more partitions than filehandles, then you're back to doing your own buffering:
set up 'n', empty partition buffers and 'n' empty parition files.
read each input file in turn, allocate each record to its partition.
when a partition buffer fills, append it to its respective partition file.
once all input files are done, read each partition (in key order !), sort it and append to the output.
The attraction of this is that the partition phase is reasonably straightforward and the sorting step can be reduced to almost any size by increasing the number of partitions. The downside is that if your "guess" about how to partition is wrong, you can end up at step (4) with one or more partitions which are too big to sort directly. Now you have a choice: either repartition (which has the advantage of using the same code, but the disadvantage of having to know more about the keys) or split, sort and merge (which has the disadvantage of being yet more code).
The sorting of large files is an interesting problem in itself. What it comes down to is that you can only sort what you can fit in memory; after that, its all merging. Everything else is one or other way of dividing the problem down. In general, you can speed up the sorting by having an auxiliary array of pointers to the records, and sorting that array -- this reduces the amount of copying. Happily, Perl arrays are implicitly arrays of pointers to the value of entry -- so you don't need to worry about that.
I still like Knuth for the fundamentals... don't dismiss the discussion of tape sorting entirely, though you might want to think "file handle" instead of "tape". You may also see discussion of drum storage... don't worry about that.
In reply to Re^2: how to read multiple file and sort the lines
by gone2015
in thread how to read multiple file and sort the lines
by newPerlr
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |