in reply to How to eliminate redundancy in huge dataset (1,000 - 10,000)

I think you just loop through the data set and keep a hash of the IDs. With each record, you check if the ID is in the hash , if it is, skip to the next record, if it isn't, write it to the output file and then skip to the next record.

10,000 records isn't really that big. I used to bulls-eye womp rats with my t-16 back home, and they're not much smaller than 10,000 records. er... I mean, I process batches of 10,000 records all the time.

--Pileofrogs

  • Comment on Re: How to eliminate redundancy in huge dataset (1,000 - 10,000)