in reply to Merge Purge

You probably should be using a database, but if you are dead set against that for some reason...
My first thought is how have you determined that this record matches another? By looking at other records? If so, why break the "clumping" into a seperate pass?

Tailoring a better solution will depend on what your data is. What changes, what doesn't etc. Should all output records contain all fields that every other matching record contains?

If you on nix or are using cygwin, maybe you should make match the first key in the file and then pipe it into GNU sort (Good at handling large files and pretty fast.) Then all you reads should be sequential and you'll only have to hold the current match data in memory. Hope this helps.

-Lee

"To be civilized is to deny one's nature."

Replies are listed 'Best First'.
Re: Re: Merge Purge
by krazken (Scribe) on Mar 22, 2002 at 15:09 UTC
    I would use a database for this, but I already have it in a flat file, and the program that assigns that matchkey runs on a flat file as well, so instead of wasting time trying to load millions of records into a database, I just work on the flat file. Plus the file is already sorted on the matchkey coming out the previous program. I probably need to take the approach of taking advantage of the fact that the file is sorted and read until my matchkey changes then process that matchgroup and then read the next. But, there are times when I /try/ to write flexible code to where it wouldn't matter if the file was sorted or not. I would like it to work either way. make sense?
      I think with your DB_File approach, the biggest problem is one read/write for every record. I had a similar problem with a search index for 5,000,000 books. The thing took around 18 hours to finish. Taking advantage or sorting it and working with the current record cut it down to 17 minutes.

      One thing I thought of (I don't know if someone else has done it. Couldn't find it at the time.) was to subclass tie DB_File and make a hash that wouldn't always read and write on every access. It would have an intermidiate cache. If you implemented caching behavior like this, it would probably speed it up an order of magnitude when the data was fairly sorted and still work about the same for the general case all while being nice and generic.

      -Lee

      "To be civilized is to deny one's nature."