newPerlr has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: how to read multiple file and sort the lines
by GrandFather (Saint) on Nov 26, 2008 at 04:48 UTC | |
There are two parts to the puzzle - how to sort a large file, and how to merge multiple large sorted files. Super Search will help with the first part. The second part is achieved by reading a record from each file then outputting and replacing the smallest record in the current set until all records have been processed. Something like:
Prints:
Perl reduces RSI - it saves typing | [reply] [d/l] [select] |
by gone2015 (Deacon) on Nov 26, 2008 at 12:55 UTC | |
The merge requires all the sorted intermediate files to be open at once: if you have a lot of files, this may run out of filehandles. You can: If the original files are small compared to the amount you can sort, reading and sorting several together in the first phase reduces the scale of the merging phase(s). But a good, general solution would allow for an arbitrarily large number of files in the merge. The merge itself requires you to maintain a list of the current record for all the files to be merged. Each time you output a record, you read from its file and decide which is the next record to be output. In general, there are a number of ways to do this: In Perl, the trivial way to do this is to simply replace the record you just wrote with the next from its respective file, and sort the list -- you probably have to have a very large number of files to merge before it's worth using a heap (unless there's a good heap module somewhere -- see heap in CPAN). If you know something about the keys, you could partition the original files, sort the partitions and concatenate. You have a trade-off here between the complexity of many smaller partitions (easier to sort, but harder to partition) and few larger partitions (harder to sort if exceed capacity of in memory sort). The target is to be able to sort each partition in memory, if that requires more partitions than filehandles, then you're back to doing your own buffering: The attraction of this is that the partition phase is reasonably straightforward and the sorting step can be reduced to almost any size by increasing the number of partitions. The downside is that if your "guess" about how to partition is wrong, you can end up at step (4) with one or more partitions which are too big to sort directly. Now you have a choice: either repartition (which has the advantage of using the same code, but the disadvantage of having to know more about the keys) or split, sort and merge (which has the disadvantage of being yet more code). The sorting of large files is an interesting problem in itself. What it comes down to is that you can only sort what you can fit in memory; after that, its all merging. Everything else is one or other way of dividing the problem down. In general, you can speed up the sorting by having an auxiliary array of pointers to the records, and sorting that array -- this reduces the amount of copying. Happily, Perl arrays are implicitly arrays of pointers to the value of entry -- so you don't need to worry about that. I still like Knuth for the fundamentals... don't dismiss the discussion of tape sorting entirely, though you might want to think "file handle" instead of "tape". You may also see discussion of drum storage... don't worry about that. | [reply] |
|
Re: how to read multiple file and sort the lines
by graff (Chancellor) on Nov 26, 2008 at 06:05 UTC | |
That basic command line does the simplest kind of sorting and is very fast and efficient for most typical situations, but there are plenty of option flags you can add in to do the sorting in a variety of ways, and with a variety of settings for optimizing to a particular usage. (update: Welcome to the Monastery, BTW... In case you hadn't noticed from previous visits, monks usually don't provide newbies with code to fulfill homework assignments -- the first reply above was an exception to the rule. The preference here is that you show us some code that you've tried, with some explanation of what you want to achieve, so that we can help you learn to solve the problem for yourself. You don't learn very much when other people do your homework for you.) | [reply] [d/l] |