in reply to Removing duplicates in large files
I assume your code looks something like the following?
#! /usr/bin/perl -w use strict; my %seen; while( <> ) { chomp; ++$seen{$_}; } print "$_\n" for keys %seen;
If that's the case, the script will run as fast as memory allows. If you begin to swap, that might not be very fast at all. In that case, you could use each instead of for, which will avoid creating a huge array containing all the keys. In that case, change the last line to:
print "$_\n" while ($_ = each %seen);
If you haven't begun to swap by the time you've loaded all the lines you'll be fine. On the other hand, if you are, you'll either have to buy more RAM, or use a divide and conquer approach.
The following should get you started:
But any of the other techniques posted here would do just as well. I would personally do
sort -u -o uniq.txt <dups.txt
... and pick up the results in uniq.txt. There are ways of doing this in Windows, you know.
Oh, and one thing, what do you mean about timeouts?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Removing duplicates in large files (a hash, or divide-and-conquer)
by sfink (Deacon) on Jan 30, 2004 at 22:36 UTC |