Update: archaeological note: The OP's previous questions are Removing duplicates in large files and End of the Time Out Tireny.
If (as you've said before) your code is crawling or running out of memory sorting only 120000 email addresses, you've got a problem with your code. If you will please show what you've tried instead of just asking again, you will get better answers. (Update: my hypothesis that memory shouldn't be an issue is predicated on the knowledge that this is running on a web server.)
Also, some evidence that you've tried something like some of the solutions proposed would be appreciated. Why don't you give the solutions in Re: Removing duplicates in large files (a hash, or divide-and-conquer) a try (starting with the first and moving down)? | [reply] |
Does this node refer back to a previous one? If so, please provide an Update pointing to it.
Normally folks post their code when asking how to do it better since, without it, we can't tell how you did it. If the code is longish, use the readmore tags or put it on your scratchpad.
-Theo-
(so many nodes and so little time ... )
| [reply] |
| [reply] |
I think I've made the scratchpad suggestion before, but with the mental intent that if it actually came about I would be obliged to extract the useful part of what was there and put it back in the question thread.
| [reply] |
Didn't you read the replies to your last thread? Create a hash, and add a key for each item in your list. Then use keys %hash to get your list of unique items. If you need to sort it, you can also remove nonunique items very easily from a sorted list with: my ( @res, $last );
foreach my $item ( @sorted ) {
next if ( defined $last and $item eq $last );
push @res, $item;
$last = $item;
};
This has the side effect of preserving the order of the list. | [reply] [d/l] [select] |
| [reply] |
Although the method I have devised to search for duplicates, does indeed remove all duplicates. It is slow and is probably not the best way to do it.
No one (including you) will ever know if there is a better way to do it until you show us how you are doing it now. Post something that looks like code and indicates what you are doing.
Is there such a thing as a sort list without duplicates function usable on a windows system?? I'm sure this can be done in UNIX. Thanks again you guys are most skilled in programming balance.
On any unix system, you would do a command line like this:
sort -u file.txt > sorted-uniq-file.txt
There are at least a few good sources (Cygwin, GNU, ATT Research Labs) where you can get comprehensive kits that port all the basic unix command-line utilities -- not just sort, but also ls, find, cut, paste, grep, awk, tar ... and most important, the bash shell -- for use on any MS-Windows system (including source code and gcc compiler, if you're into that sort of thing).
| [reply] [d/l] |