Sorting A File Of Large Records

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting A File Of Large Records by dws (Chancellor) on Dec 10, 2002 at 19:20 UTC
I'm thinking I don't want to slurp the whole file into an array or a hash and then somehow sort that. If you don't want to pull the whole file into memory, you have a several alternatives. Here are two: Pull pieces of the file into memory, sort them, and write them to temporary files. Then merge the results into a final, sorted file. Scan the file once, remembering the seek offset of the beginning of each record, and the key you want to sort on. Sort the {key, offset} pairs, and then use this sorted list to seek/read records, emitting them in sorted order into a new file. If you have enough memory to deal with the {key, offset} pairs, I'd go that way. It's easier to code. The descriptions of `tell()` and `seek()` in perlfunc tell you what you need.	[reply] [d/l] [select]
Re: Re: Sorting A File Of Large Records by Anonymous Monk on Dec 10, 2002 at 21:48 UTC
Thanks for idea #2. So I've got my {key, offset} pair nicely sorted - no problem there. I can also `seek` to each offset in the unsorted file without a problem. My problem is this: once I'm in the correct starting position for my next record in the unsorted file (i.e after calling `seek`), how can I extract just the next record and not the rest of the file from that point on: `for my $zip(sort{ $a <=> $b }(keys(%zips))) { seek FILE, $zips{$zip}, 0; print NEW <FILE>; }` [download] Obviously, `<FILE>` contains the rest of the file following OFFSET (i.e `$zips{$zip}`) and not just the next record. Any ideas as to what I'm doing wrong?	[reply] [d/l] [select]
Re: Re: Re: Sorting A File Of Large Records by dws (Chancellor) on Dec 10, 2002 at 21:56 UTC
My problem is this: once I'm in the correct starting position for my next record in the unsorted file (i.e after calling seek), how can I extract just the next record and not the rest of the file from that point on? Read the file line-by-line (i.e., using `<FILE>` in scalar context) until you've read the complete record.	[reply] [d/l]
Re: Sorting A File Of Large Records by dingus (Friar) on Dec 10, 2002 at 20:01 UTC
This could be a good choice to use An APL trick for the Schwartzian Transform as that is almost exactly the reason why I thought it up in the first place... There are some subtelties, mostly specified above that help. The first trick (suggested by dmitri) is to set $/ to '-------'.$/ so that you get each record. If the file is far too big to fit into memory then the second is to create the sort based on the location within the file and the zip code - use tell() with each record. Other tricks may depend on whether the file is local or not (whether you can afford to read it multiple times) whether you want to sort on a secondary key as well and so on. Actually thinking about, assuming that you have sufficient memory and no secondary key you wish to sort on my prefered solution would be a two phase sort. Phase one is an insertion sort into a hash of zip codes. Then you read the file again and write out the sorted version. `local $/ = '----------'.$/; my %zips; open (FILE,"<filename") or die "error $!"; my $teller = tell(FILE); NB need position before file read! while (<FILE>); die "no zip code found record at $teller\n" unless (/Zip:\d{5}/s); push @{$zips{$1}}, $teller; $teller = tell(FILE); # FILE is optional here but a good idea! } open (SORTED,">newfile") or die "Can't open newfile: $!"; for (sort keys %zips) { for (@{$zips{$_}) { seek FILE, $_, 0; print SORTED <FILE>; } } close FILE; close SORTED;` [download] Disclaimer - code untested may contain horrible bugs Dingus `Enter any 47-digit prime number to continue.`	[reply] [d/l]
Re: Re: Sorting A File Of Large Records by vek (Prior) on Dec 10, 2002 at 23:18 UTC
Nicely done dingus. I'd offer one tweak. I assume that the OP wanted to print just the current record to the "sorted" file and not the whole file from the `seek` offset: `for my $zipCode (sort{ $a <=> $b }(keys(%zips))) { for my $offset (@{$zips{$zipCode}}) { seek FILE, $offset, 0; while (<FILE>) { last unless (/Zip:$zipCode/); print SORTED $_; } } }` [download] -- vek --	[reply] [d/l] [select]
Re: Sorting A File Of Large Records by pfaut (Priest) on Dec 10, 2002 at 19:21 UTC
If the file is too large to sort in memory, maybe you can use perl to convert it to a format that would lend itself to a normal file sort. Write a program to convert the multiline format to a single line format and another program to convert it back. You might even consider writing these programs as filters so the data would just pass through instead of having to create temporary intermediate files.	[reply]
Re: Re: Sorting A File Of Large Records by dmitri (Priest) on Dec 10, 2002 at 19:31 UTC
You could also set `$/` to `"--------\n"` (or something like that), if the separating lines are the same.	[reply]
Re: Sorting A File Of Large Records by CountZero (Bishop) on Dec 10, 2002 at 19:45 UTC
If you have a database handy somewhere, parse the file for the fieldnames and values, dump them to the database and extract them again with SQL (or similar) ordered in any way you like. When management sees what a nice job you have done and asks you next to sort it according to another field, all you have to do is write a new SQL query and the job is done. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Sorting A File Of Large Records by grantm (Parson) on Dec 10, 2002 at 21:54 UTC
The XML::Filter::Sort module allows sorting arbitrarily large amount of data which is in XML format. You could use it like this: convert your data to XML (eg: <rec id="999999" Name="John Doe" ... /> ...) parse the XML file with a SAX parser that passes events to ... XML::Filter::Sort which sorts the events and passes them to ... XML::SAX::Writer which outputs an XML file convert the XML back to you original format However, since XML::Filter::Sort is a SAX filter (rather than an XML filter), it's not actually necessary to use XML data at any stage of the game. Instead you could do it like this: write a simple SAX generator class (inherit from SAX::Base) which reads the text file and generates SAX events to ... XML::Filter::Sort which sorts the events and passes them to ... a simple SAX handler class which accepts events and writes out plain text in the original format Although you didn't list maintainability as a key requirement, this type of arrangement would separate your code into logical units which each implement a well defined subset of the whole operation.	[reply]
Re: Sorting A File Of Large Records by metlhed_ (Beadle) on Dec 10, 2002 at 19:30 UTC
Since the file is large you could try something like this: go through the whole file getting only the zip codes sort the zip codes in the order you need loop through the zips, for each zip go through the file and output the matched records. This is very inefficent but, if your file is too large to put into memory and time isn't a big deal, it should work. Edit: I just realised I described a very poor method similar to dws's 2nd suggestion. I would go with dws's 2nd suggestion	[reply]