in reply to Sorting A File Of Large Records

I'm thinking I don't want to slurp the whole file into an array or a hash and then somehow sort that.

If you don't want to pull the whole file into memory, you have a several alternatives. Here are two:

  1. Pull pieces of the file into memory, sort them, and write them to temporary files. Then merge the results into a final, sorted file.

  2. Scan the file once, remembering the seek offset of the beginning of each record, and the key you want to sort on. Sort the {key, offset} pairs, and then use this sorted list to seek/read records, emitting them in sorted order into a new file.

If you have enough memory to deal with the {key, offset} pairs, I'd go that way. It's easier to code. The descriptions of tell() and seek() in perlfunc tell you what you need.

Replies are listed 'Best First'.
Re: Re: Sorting A File Of Large Records
by Anonymous Monk on Dec 10, 2002 at 21:48 UTC
    Thanks for idea #2. So I've got my {key, offset} pair nicely sorted - no problem there. I can also seek to each offset in the unsorted file without a problem. My problem is this: once I'm in the correct starting position for my next record in the unsorted file (i.e after calling seek), how can I extract *just* the next record and not the rest of the file from that point on:
    for my $zip(sort{ $a <=> $b }(keys(%zips))) { seek FILE, $zips{$zip}, 0; print NEW <FILE>; }
    Obviously, <FILE> contains the rest of the file following OFFSET (i.e $zips{$zip}) and not just the next record. Any ideas as to what I'm doing wrong?
      My problem is this: once I'm in the correct starting position for my next record in the unsorted file (i.e after calling seek), how can I extract *just* the next record and not the rest of the file from that point on?

      Read the file line-by-line (i.e., using <FILE> in scalar context) until you've read the complete record.