Re: How to remove duplicates from a large set of keys

Replies are listed 'Best First'.
Re^2: How to remove duplicates from a large set of keys by nite_man (Deacon) on Feb 10, 2005 at 08:46 UTC
Unfortunately, lookup in the hash with more than million keys takes a lot of time. I suspect that there is a way to do it more efficient. --- Michael Stepanov aka nite_man It's only my opinion and it doesn't have pretensions of absoluteness!	[reply]
Re^3: How to remove duplicates from a large set of keys by demerphq (Chancellor) on Feb 10, 2005 at 08:50 UTC
Lookup in a hash of 1 million keys is rougly the same as lookup in a hash of 10 keys. :-) (Assuming you are still inside of physical memory.) Its creating the hash thats the problem. It takes a long time, especially if you dont know how many records you are storing up front. --- demerphq	[reply]
Re^4: How to remove duplicates from a large set of keys by nite_man (Deacon) on Feb 10, 2005 at 09:04 UTC
Lookup in a hash of 1 million keys is rougly the same as lookup in a hash of 10 keys Can you explain, please --- Michael Stepanov aka nite_man It's only my opinion and it doesn't have pretensions of absoluteness!	[reply]
Re^5: How to remove duplicates from a large set of keys by demerphq (Chancellor) on Feb 10, 2005 at 09:20 UTC
Re^3: How to remove duplicates from a large set of keys by jbrugger (Parson) on Feb 10, 2005 at 08:59 UTC
Is it? This script took 6 seconds to complete (run), with 2 million keys, on a Celeron 2.4 Ghz, 512 Mb, running X, Mozilla, Apache, john, etc. etc. and took at max. 27 % of the memory `#!/usr/bin/perl -w use strict; my %lookupTable; for (my $i=0; $i<=2000000; $i++) { if (!$lookupTable{$i}) { $lookupTable{$i} = 1; } }` [download]	[reply] [d/l]
Re^4: How to remove duplicates from a large set of keys by FitTrend (Pilgrim) on Feb 10, 2005 at 13:54 UTC
I was curious so I took the same code, added a timer, and ran it on win32. It took 10 seconds. However, I'm running a 1.8 Pentium M, 1G of memory, WinXP, ActivePerl 5.8.6. I suspect that nite_man does not have as beefy of a box as you do. It might be acceptable if the look up doesn't occur frequently. In retrospect, it wasn't a good test since the system specs weren't the same and I'm not fimilar with the whole problem. But it was a good 30 second distraction. :) For the curious: `#!perl -w use strict; my $startTime = time(); my %lookupTable; for (my $i=0; $i<=2000000; $i++) { if (!$lookupTable{$i}) { $lookupTable{$i} = 1; } } my $totalTime = time() - $startTime; print "Total Time: $totalTime\n";` [download] -Marc	[reply] [d/l]