I assume your code looks something like the following?

#! /usr/bin/perl -w use strict; my %seen; while( <> ) { chomp; ++$seen{$_}; } print "$_\n" for keys %seen;

If that's the case, the script will run as fast as memory allows. If you begin to swap, that might not be very fast at all. In that case, you could use each instead of for, which will avoid creating a huge array containing all the keys. In that case, change the last line to:

print "$_\n" while ($_ = each %seen);

If you haven't begun to swap by the time you've loaded all the lines you'll be fine. On the other hand, if you are, you'll either have to buy more RAM, or use a divide and conquer approach.

The following should get you started:

  1. Take the email address and match it against /^([^@]*)@(.*)$/. (This is quick and dirty, but probably adequate for the task at hand).
  2. If it doesn't match print $_ into a file named 'dunno'.
  3. If it does, open $2 as a filename for output, print $1 into it and then close it.
  4. Process all the lines this way. Warning, this will be extraordinarly slow.
  5. At the end, for each file you have written, use the hash technique above to weed out the duplicates.
  6. Regenerate the original address from the current line and name of current file.

But any of the other techniques posted here would do just as well. I would personally do

sort -u -o uniq.txt <dups.txt

... and pick up the results in uniq.txt. There are ways of doing this in Windows, you know.

Oh, and one thing, what do you mean about timeouts?


In reply to Re: Removing duplicates in large files (a hash, or divide-and-conquer) by grinder
in thread Removing duplicates in large files by TIURIC

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.