Removing duplicates in large files

TIURIC has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing duplicates in large files by stvn (Monsignor) on Jan 30, 2004 at 20:04 UTC
You are using the wrong tool to solve your problem. Install MySQL Create a table with a email column Give the email column a "unique" constraint Import the file (just tell MySQL its a CSV file with one column) SELECT out the results into an newline delimited file If you dont want to install MySQL, most any other DB will do (even Access probably). If the other DB doesn't allow uniqueness constraints on fields then just SELECT DISTINCT instead. This might seem like alot of work, but if properly set up, you can use it over and over to process this file whenever you need to. Or you could just start using the DB instead anyway since it will make your life a lot easier anyway. -stvn Update: The more i re-read your post the more confused I am getting. Are these already in a database (RDMS)? Or is it in a text-file-database? If its already in a RDMS then just do a SELECT DISTINCT <column name> FROM <tablename> to get what you want, otherwise, see above. If you can't install the DB on the server, do it on your local machine and download the file. And for sure,.. don't do something like this with a CGI script, it will surely timeout no matter what you do.	[reply]
Re: Re: Removing duplicates in large files by Anonymous Monk on Jan 30, 2004 at 20:26 UTC
Thank you very much that is pretty much what I thought. It is a text based file. I am using CGI since this is an internet based software pagage I am developing. So I have No choice but to use SQL, can't say I am that familiar with it. But I think it is time I get familiar because I have a lot of these type of files and CGI will never do the job.. Thanks again.	[reply]
Re: Removing duplicates in large files by lestrrat (Deacon) on Jan 30, 2004 at 20:29 UTC
I suppose that if you must use Perl for this, you could use DB_File (or other *DB_File modules), and just keep chugging the email address to a file. Since this being a hash, it would weed out duplicates. some code fragments... `use DB_File; my %hash = tie(...., 'DB_File'....); my $fh = something_to_open_the_file(...); while (my $addr = <$fh>) { chomp($addr); $hash{lc($addr)} = 1; }` [download] Then you can open that db that DB_File created, and dump it to a file, whatever. However, if you got that much data I would use SQL ;)	[reply] [d/l]
Re: Removing duplicates in large files by crabbdean (Pilgrim) on Jan 30, 2004 at 20:31 UTC
Well really I don't think 120,00 is all that big. I recently wrote a script to parse out duplicates on our backup tapes. I clocked the number of cycles and it pushed near on 9 million in about 30 secs (from memory). And I'm not on on a ripper of a machine. Your 120,000 it should rip through relatively quickly. Putting speed aside now I'm more interested in how you are weeding out duplicates. Using hashes will be faster. Use the email address as the key to the hash and just test if the key already exists. If it does then you know you have a duplicate and output to a seperate file or just ignore it completely. I'll assume you know how to read in your emails into an array. Just do a foreach on the array and put each into the hash and then output any that already exists. For example below see below. Enjoy! Dean `foreach (@array) { if (! $hash{$_}) { $hash{$_} = $_; } else { ## duplicates output here or ignore print $_; } } # then print your hash to get your none duplicate results` [download]	[reply] [d/l]
Re: Re: Removing duplicates in large files by stvn (Monsignor) on Jan 31, 2004 at 01:39 UTC
Assuming too that all your emails were in an array, it is even simpler to let perl do most of the work: `%hash_of_unique_emails = map { $_ => 1 } @array_of_emails` [download] Or even better, as a one liner from the shell: `perl -e 'print keys %{ { map { $_ => 1 } <> } }' < test.data` [download] That should work on windows too (worked fine on my OS X machine, but alas, no windows box to test it on here). -stvn update intially forgot the map {} in the first example ,.. sorry been a long day,..	[reply] [d/l] [select]
Re: Removing duplicates in large files (a hash, or divide-and-conquer) by grinder (Bishop) on Jan 30, 2004 at 20:34 UTC
I assume your code looks something like the following? `#! /usr/bin/perl -w use strict; my %seen; while( <> ) { chomp; ++$seen{$_}; } print "$_\n" for keys %seen;` [download] If that's the case, the script will run as fast as memory allows. If you begin to swap, that might not be very fast at all. In that case, you could use each instead of for, which will avoid creating a huge array containing all the keys. In that case, change the last line to: `print "$_\n" while ($_ = each %seen);` [download] If you haven't begun to swap by the time you've loaded all the lines you'll be fine. On the other hand, if you are, you'll either have to buy more RAM, or use a divide and conquer approach. The following should get you started: Take the email address and match it against `/^([^@])@(.)$/`. (This is quick and dirty, but probably adequate for the task at hand). If it doesn't match `print $_` into a file named 'dunno'. If it does, open $2 as a filename for output, print $1 into it and then close it. Process all the lines this way. Warning, this will be extraordinarly slow. At the end, for each file you have written, use the hash technique above to weed out the duplicates. Regenerate the original address from the current line and name of current file. But any of the other techniques posted here would do just as well. I would personally do `sort -u -o uniq.txt <dups.txt` [download] ... and pick up the results in `uniq.txt`. There are ways of doing this in Windows, you know. Oh, and one thing, what do you mean about timeouts?	[reply] [d/l] [select]
Re: Re: Removing duplicates in large files (a hash, or divide-and-conquer) by sfink (Deacon) on Jan 30, 2004 at 22:36 UTC
If you're doing that, you may as well do it in one pass: `while (<>) { print unless $seen{$_}++; }` [download] You could also shrink the memory usage by computing your own hash value and using that as the %seen key -- but I don't think I'm going to get into any more details unless the original poster swears that this has nothing to do with harvesting addresses for spammers.	[reply] [d/l]
Re: Removing duplicates in large files by b10m (Vicar) on Jan 30, 2004 at 19:46 UTC
On a *NIX system: easy! :) `$ sort file_with_duplicates \| uniq > file_without_duplicates` [download] -- b10m All code is usually tested, but rarely trusted.	[reply] [d/l]
Re: Re: Removing duplicates in large files by TIURIC (Initiate) on Jan 30, 2004 at 19:54 UTC
How about using a wondows OS? Right now I am using an extremely simple code utilising a foreach loop to scan through the e-mails and find dups this is not efficient. Because the server times out before it can even get halfway through all the e-mail addys. Even if it didn't it would still take 100 years to load the page doing it this way. Is there a window equive to what you are proposing? TIURIC	[reply]
Re: Removing duplicates in large files by b10m (Vicar) on Jan 30, 2004 at 20:10 UTC
Well, of course you could get the GNU tools from http://unxutils.sourceforge.net/, but I'm sure some monk will push this thread into something more Perl'ish ;-) -- b10m All code is usually tested, but rarely trusted.	[reply]
Re: Re: Re: Removing duplicates in large files by iburrell (Chaplain) on Jan 30, 2004 at 20:48 UTC
What server? You didn't say anything about a server in your initial description. What is the CGI process doing? A 120,000 line file is not that big. Perl should be able to tear through it in a couple of seconds. The issue is probably somewhere else, either a slow algorithm or network issues. Since you didn't say where the script is running, how it is getting the data, or how it is communicating with the user, we can't help you.	[reply]
Re: Removing duplicates in large files by TIURIC (Initiate) on Jan 30, 2004 at 19:48 UTC
120 000 email file delimited by newlines. Searching for duplicates on a windows operating system. Keep getting timed out because it takes forever to go through each email addy. This evil must be stopped. TIURIC.	[reply]