newPerlr has asked for the wisdom of the Perl Monks concerning the following question:

I'm new to perl so please can any one help me. I have to create a script for an application that takes in multiple files and sorts the data in a order and writes it to one file it has to be efficient and fast. I wrote a script that reads the file and sorts it but I'm putting it to an array and the files are like 50MB and its taking up all the memory can any one tell me how i could do this or if some can give me code for it i would be relay grateful. Thank u in advance.
  • Comment on how to read multiple file and sort the lines

Replies are listed 'Best First'.
Re: how to read multiple file and sort the lines
by GrandFather (Saint) on Nov 26, 2008 at 04:48 UTC

    There are two parts to the puzzle - how to sort a large file, and how to merge multiple large sorted files. Super Search will help with the first part. The second part is achieved by reading a record from each file then outputting and replacing the smallest record in the current set until all records have been processed. Something like:

    use strict; use warnings; my $data1 = <<DATA; 1 3 7 10 29 DATA my $data2 = <<DATA; 2 4 11 20 22 24 30 DATA my $data3 = <<DATA; 90 DATA my @files; # Open files and add them to the list for my $inFile (\$data1, \$data2, \$data3) { push @files, []; open $files[-1][0], '<', $inFile; my $fh = $files[-1][0]; $files[-1][1] = <$fh>; if (!defined $files[-1][1]) { close $files[-1][0]; pop @files; next; } chomp $files[-1][1]; } while (@files) { my $minRecIndex = 0; # Find the current minimum record for my $cmpRecIndex (0 .. $#files) { $minRecIndex = $cmpRecIndex if $files[$cmpRecIndex][1] < $files[$minRecIndex][1]; } print "$files[$minRecIndex][1]\n"; my $fh = $files[$minRecIndex][0]; $files[$minRecIndex][1] = <$fh>; if (defined $files[$minRecIndex][1]) { chomp $files[$minRecIndex][1]; } else { close $files[$minRecIndex][0]; splice @files, $minRecIndex, 1; } }

    Prints:

    1 2 3 4 7 10 11 20 22 24 29 30 90

    Perl reduces RSI - it saves typing

      The merge requires all the sorted intermediate files to be open at once: if you have a lot of files, this may run out of filehandles. You can:

      • generalise the code to merge sub-sets of the sorted files, into a new, smaller sets of larger sorted files, and repeat. This is relatively straightforward, but multiplies the number of times you process each record.

      • do your own buffering for each input to the merge, and open/close/seek files as required. For all I know, there's a module to do this (see CPAN).

      If the original files are small compared to the amount you can sort, reading and sorting several together in the first phase reduces the scale of the merging phase(s). But a good, general solution would allow for an arbitrarily large number of files in the merge.

      The merge itself requires you to maintain a list of the current record for all the files to be merged. Each time you output a record, you read from its file and decide which is the next record to be output. In general, there are a number of ways to do this:

      • scan the list looking for the smallest (assuming ascending order sort) -- that's O(n) in the number of files to be merged.

      • keep the list in order, and insert each new one in place -- that's O(n/2).

      • if 'n' is large, I'd use a 'heap' -- which is O(log2 n). (The heap is a beautiful structure. Once you have your head around it, not a lot of code either.)

      In Perl, the trivial way to do this is to simply replace the record you just wrote with the next from its respective file, and sort the list -- you probably have to have a very large number of files to merge before it's worth using a heap (unless there's a good heap module somewhere -- see heap in CPAN).

      If you know something about the keys, you could partition the original files, sort the partitions and concatenate. You have a trade-off here between the complexity of many smaller partitions (easier to sort, but harder to partition) and few larger partitions (harder to sort if exceed capacity of in memory sort). The target is to be able to sort each partition in memory, if that requires more partitions than filehandles, then you're back to doing your own buffering:

      1. set up 'n', empty partition buffers and 'n' empty parition files.

      2. read each input file in turn, allocate each record to its partition.

      3. when a partition buffer fills, append it to its respective partition file.

      4. once all input files are done, read each partition (in key order !), sort it and append to the output.

      The attraction of this is that the partition phase is reasonably straightforward and the sorting step can be reduced to almost any size by increasing the number of partitions. The downside is that if your "guess" about how to partition is wrong, you can end up at step (4) with one or more partitions which are too big to sort directly. Now you have a choice: either repartition (which has the advantage of using the same code, but the disadvantage of having to know more about the keys) or split, sort and merge (which has the disadvantage of being yet more code).

      The sorting of large files is an interesting problem in itself. What it comes down to is that you can only sort what you can fit in memory; after that, its all merging. Everything else is one or other way of dividing the problem down. In general, you can speed up the sorting by having an auxiliary array of pointers to the records, and sorting that array -- this reduces the amount of copying. Happily, Perl arrays are implicitly arrays of pointers to the value of entry -- so you don't need to worry about that.

      I still like Knuth for the fundamentals... don't dismiss the discussion of tape sorting entirely, though you might want to think "file handle" instead of "tape". You may also see discussion of drum storage... don't worry about that.

Re: how to read multiple file and sort the lines
by graff (Chancellor) on Nov 26, 2008 at 06:05 UTC
    I assume you are talking about a homework assignment here. Why else would you have to write a script for something that was solved decades ago on unix systems? Especially when you consider that the unix (now gnu) solution has been ported to (and is currently available on) all popular operating systems?
    sort multiple*files > single-sorted.file
    That basic command line does the simplest kind of sorting and is very fast and efficient for most typical situations, but there are plenty of option flags you can add in to do the sorting in a variety of ways, and with a variety of settings for optimizing to a particular usage.

    (update: Welcome to the Monastery, BTW... In case you hadn't noticed from previous visits, monks usually don't provide newbies with code to fulfill homework assignments -- the first reply above was an exception to the rule. The preference here is that you show us some code that you've tried, with some explanation of what you want to achieve, so that we can help you learn to solve the problem for yourself. You don't learn very much when other people do your homework for you.)