Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a file of large records. Each record spans multiple lines:
id:999999 Name:John Doe StreetAddr1:44 Someplace Dr City:Someplace State:CA Zip:99999 MxId:33 Routing:b7 ---------- id:789654 Name:Someone Else StreetAddr1:33 Someotherplace Dr etc...
Each record has more fieldname fieldvalue pairs, for the purpose of this example I removed them.

My task is to keep the file format the same but sort the records by Zip. I must admit I'm at a loss as to how to begin. As each record is rather large and the size of the entire file could also be rather large I'm thinking I don't want to slurp the whole file into an array or a hash and then somehow sort that. Any pointers (and I'm not asking you to do the work for me and I'm not asking for code), be they links to modules or a simple explanation of how I would begin would be most helpful.

Replies are listed 'Best First'.
Re: Sorting A File Of Large Records
by dws (Chancellor) on Dec 10, 2002 at 19:20 UTC
    I'm thinking I don't want to slurp the whole file into an array or a hash and then somehow sort that.

    If you don't want to pull the whole file into memory, you have a several alternatives. Here are two:

    1. Pull pieces of the file into memory, sort them, and write them to temporary files. Then merge the results into a final, sorted file.

    2. Scan the file once, remembering the seek offset of the beginning of each record, and the key you want to sort on. Sort the {key, offset} pairs, and then use this sorted list to seek/read records, emitting them in sorted order into a new file.

    If you have enough memory to deal with the {key, offset} pairs, I'd go that way. It's easier to code. The descriptions of tell() and seek() in perlfunc tell you what you need.

      Thanks for idea #2. So I've got my {key, offset} pair nicely sorted - no problem there. I can also seek to each offset in the unsorted file without a problem. My problem is this: once I'm in the correct starting position for my next record in the unsorted file (i.e after calling seek), how can I extract *just* the next record and not the rest of the file from that point on:
      for my $zip(sort{ $a <=> $b }(keys(%zips))) { seek FILE, $zips{$zip}, 0; print NEW <FILE>; }
      Obviously, <FILE> contains the rest of the file following OFFSET (i.e $zips{$zip}) and not just the next record. Any ideas as to what I'm doing wrong?
        My problem is this: once I'm in the correct starting position for my next record in the unsorted file (i.e after calling seek), how can I extract *just* the next record and not the rest of the file from that point on?

        Read the file line-by-line (i.e., using <FILE> in scalar context) until you've read the complete record.

Re: Sorting A File Of Large Records
by dingus (Friar) on Dec 10, 2002 at 20:01 UTC
    This could be a good choice to use An APL trick for the Schwartzian Transform as that is almost exactly the reason why I thought it up in the first place...

    There are some subtelties, mostly specified above that help. The first trick (suggested by dmitri) is to set $/ to '-------'.$/ so that you get each record.

    If the file is far too big to fit into memory then the second is to create the sort based on the location within the file and the zip code - use tell() with each record.

    Other tricks may depend on whether the file is local or not (whether you can afford to read it multiple times) whether you want to sort on a secondary key as well and so on.

    Actually thinking about, assuming that you have sufficient memory and no secondary key you wish to sort on my prefered solution would be a two phase sort. Phase one is an insertion sort into a hash of zip codes. Then you read the file again and write out the sorted version.

    local $/ = '----------'.$/; my %zips; open (FILE,"<filename") or die "error $!"; my $teller = tell(FILE); NB need position before file read! while (<FILE>); die "no zip code found record at $teller\n" unless (/Zip:\d{5}/s); push @{$zips{$1}}, $teller; $teller = tell(FILE); # FILE is optional here but a good idea! } open (SORTED,">newfile") or die "Can't open newfile: $!"; for (sort keys %zips) { for (@{$zips{$_}) { seek FILE, $_, 0; print SORTED <FILE>; } } close FILE; close SORTED;
    Disclaimer - code untested may contain horrible bugs

    Dingus


    Enter any 47-digit prime number to continue.
      Nicely done dingus. I'd offer one tweak. I assume that the OP wanted to print just the current record to the "sorted" file and not the whole file from the seek offset:
      for my $zipCode (sort{ $a <=> $b }(keys(%zips))) { for my $offset (@{$zips{$zipCode}}) { seek FILE, $offset, 0; while (<FILE>) { last unless (/Zip:$zipCode/); print SORTED $_; } } }
      -- vek --
Re: Sorting A File Of Large Records
by pfaut (Priest) on Dec 10, 2002 at 19:21 UTC
    If the file is too large to sort in memory, maybe you can use perl to convert it to a format that would lend itself to a normal file sort. Write a program to convert the multiline format to a single line format and another program to convert it back. You might even consider writing these programs as filters so the data would just pass through instead of having to create temporary intermediate files.
      You could also set $/ to "--------\n" (or something like that), if the separating lines are the same.
Re: Sorting A File Of Large Records
by CountZero (Bishop) on Dec 10, 2002 at 19:45 UTC

    If you have a database handy somewhere, parse the file for the fieldnames and values, dump them to the database and extract them again with SQL (or similar) ordered in any way you like.

    When management sees what a nice job you have done and asks you next to sort it according to another field, all you have to do is write a new SQL query and the job is done.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Sorting A File Of Large Records
by grantm (Parson) on Dec 10, 2002 at 21:54 UTC

    The XML::Filter::Sort module allows sorting arbitrarily large amount of data which is in XML format. You could use it like this:

    1. convert your data to XML (eg: <rec id="999999" Name="John Doe" ... /> ...)
    2. parse the XML file with a SAX parser that passes events to ...
    3. XML::Filter::Sort which sorts the events and passes them to ...
    4. XML::SAX::Writer which outputs an XML file
    5. convert the XML back to you original format

    However, since XML::Filter::Sort is a SAX filter (rather than an XML filter), it's not actually necessary to use XML data at any stage of the game. Instead you could do it like this:

    1. write a simple SAX generator class (inherit from SAX::Base) which reads the text file and generates SAX events to ...
    2. XML::Filter::Sort which sorts the events and passes them to ...
    3. a simple SAX handler class which accepts events and writes out plain text in the original format

    Although you didn't list maintainability as a key requirement, this type of arrangement would separate your code into logical units which each implement a well defined subset of the whole operation.

Re: Sorting A File Of Large Records
by metlhed_ (Beadle) on Dec 10, 2002 at 19:30 UTC

    Since the file is large you could try something like this:

    1. go through the whole file getting only the zip codes
    2. sort the zip codes in the order you need
    3. loop through the zips, for each zip go through the file and output the matched records.

    This is very inefficent but, if your file is too large to put into memory and time isn't a big deal, it should work.

    Edit: I just realised I described a very poor method similar to dws's 2nd suggestion. I would go with dws's 2nd suggestion