Re: Find duplicate lines from the file and write it into new file.

That could be a hard problem, depending on what you need the filtered data for. My usual method is to use a $hash{$line}++ type method to find dupes, but that's going to eat a lot of ram (the problem you're having now I guess) unless the lines have some identifier that you can use instead.

One option might be to build a $digest = Digest::MD5->new() and count the digests. That could be really computationally expensive, especially for a file that big. I guess it depends what the lines look like. Would it be possible to include a few sample lines?

-Paul

Comment on Re: Find duplicate lines from the file and write it into new file. Select or Download Code

Replies are listed 'Best First'.
Re^2: Find duplicate lines from the file and write it into new file. by gaal (Parson) on Jan 04, 2007 at 14:07 UTC
Hashing might help but you have to consider the risk of collisions. If records are all the same size, then seeking for comparisons can be done in linear time, but the hit for actually doing the comparisons will be a function of their how many collisions there are -- and the smaller your hashspace, the more of those there will be. If you don't know anything about a typical record, I think the mergesort approach is a good shot.	[reply]
Re^3: Find duplicate lines from the file and write it into new file. by jettero (Monsignor) on Jan 04, 2007 at 14:10 UTC
Yeah, without knowing anything about the input file I was really just brainstorming... In my experience a 4gig file is probably a log file or something and collisions may not matter very much — it really dpeneds on what the filtered output is intended to do. -Paul	[reply]