in reply to Removing duplicates in large files

I assume your code looks something like the following?

#! /usr/bin/perl -w use strict; my %seen; while( <> ) { chomp; ++$seen{$_}; } print "$_\n" for keys %seen;

If that's the case, the script will run as fast as memory allows. If you begin to swap, that might not be very fast at all. In that case, you could use each instead of for, which will avoid creating a huge array containing all the keys. In that case, change the last line to:

print "$_\n" while ($_ = each %seen);

If you haven't begun to swap by the time you've loaded all the lines you'll be fine. On the other hand, if you are, you'll either have to buy more RAM, or use a divide and conquer approach.

The following should get you started:

  1. Take the email address and match it against /^([^@]*)@(.*)$/. (This is quick and dirty, but probably adequate for the task at hand).
  2. If it doesn't match print $_ into a file named 'dunno'.
  3. If it does, open $2 as a filename for output, print $1 into it and then close it.
  4. Process all the lines this way. Warning, this will be extraordinarly slow.
  5. At the end, for each file you have written, use the hash technique above to weed out the duplicates.
  6. Regenerate the original address from the current line and name of current file.

But any of the other techniques posted here would do just as well. I would personally do

sort -u -o uniq.txt <dups.txt

... and pick up the results in uniq.txt. There are ways of doing this in Windows, you know.

Oh, and one thing, what do you mean about timeouts?

Replies are listed 'Best First'.
Re: Re: Removing duplicates in large files (a hash, or divide-and-conquer)
by sfink (Deacon) on Jan 30, 2004 at 22:36 UTC
    If you're doing that, you may as well do it in one pass:
    while (<>) { print unless $seen{$_}++; }
    You could also shrink the memory usage by computing your own hash value and using that as the %seen key -- but I don't think I'm going to get into any more details unless the original poster swears that this has nothing to do with harvesting addresses for spammers.