Re^2: Find duplicate lines from the file and write it into new file.

Hashing might help but you have to consider the risk of collisions. If records are all the same size, then seeking for comparisons can be done in linear time, but the hit for actually doing the comparisons will be a function of their how many collisions there are -- and the smaller your hashspace, the more of those there will be.

If you don't know anything about a typical record, I think the mergesort approach is a good shot.

Comment on Re^2: Find duplicate lines from the file and write it into new file.

Replies are listed 'Best First'.
Re^3: Find duplicate lines from the file and write it into new file. by jettero (Monsignor) on Jan 04, 2007 at 14:10 UTC
Yeah, without knowing anything about the input file I was really just brainstorming... In my experience a 4gig file is probably a log file or something and collisions may not matter very much — it really dpeneds on what the filtered output is intended to do. -Paul	[reply]