how to speed up dupe checking of arrays

ultibuzz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to speed up dupe checking of arrays by wind (Priest) on Jul 31, 2007 at 09:57 UTC
Most likely you're going to simply be throttled by IO speed. However, your above code could in theory be simplified by limiting the split to only 3 parts, and by adding the dup check to the while loop. Assuming that dup count is all you really care about. `if ($file =~ $spec_text){ my $file_date = (split(/\./,$file))[3]; open(IN, '<', $file) or die("open failed: $!"); my $count_uniq = 0; my %seen; while (<IN>) { chomp; my ($ele0, $ele1, undef) = split ';', $_, 3; $count_uniq++ if !$seen{"$ele0;$ele1;$file_date"}++; } print "$.\n"; # Total number of lines. print "$count_uniq\n"; close(IN); }` [download] - Miller	[reply] [d/l]
Re^2: how to speed up dupe checking of arrays by ultibuzz (Monk) on Jul 31, 2007 at 10:25 UTC
i need them in an array so i adjusted ur code like this `my @rows; my %seen; while (<IN>) { chomp; my ($ele0, $ele1, undef) = split ';', $_, 3; push @rows,"$ele0;$ele1;$file_date" if !$seen{"$ele0;$ele1 +;$file_date"}++; } close(IN);` [download] and waht shoud i say AWSOME, from 203 seconds down do around 11 seconds,great so no over hours in office needed ;) thx alot. kd ultibuzz	[reply] [d/l]
Re^3: how to speed up dupe checking of arrays by oha (Friar) on Jul 31, 2007 at 10:46 UTC
you already have them in array. `$seen{"$ele0;$ele1;$file_data"}++;` [download] then if you need the data you can, for example: `foreach (keys %seen) { .... }` [download] and the value of the hash is the number of times the string is repeated Oha	[reply] [d/l] [select]
Re: how to speed up dupe checking of arrays by FunkyMonk (Bishop) on Jul 31, 2007 at 10:23 UTC
Either `@non_dupe_rows` is very poorly named, or `my @non_dupe_rows = do { my %seen;grep !$seen{$_}++, @rows };` isn't doing what you think it's doing: `my @rows = qw/ 1 2 3 4 1 2 /; # 1 & 2 are dups my @non_dupe_rows = do { my %seen;grep !$seen{$_}++, @rows }; print "@non_dupe_rows\n";` [download] prints `1 2 3 4` [download] If you really want non-dup rows, try `my @rows = qw/ 1 2 3 4 1 2 /; # 1 & 2 are dups my @non_dupe_rows = do { my %seen; $seen{$_}++ for @rows; grep $seen{$_} == 1, keys %seen }; print "@non_dupe_rows\n";` [download]	[reply] [d/l] [select]
Re^2: how to speed up dupe checking of arrays by ultibuzz (Monk) on Jul 31, 2007 at 10:43 UTC
i want non dupes but want to keep 1 of all the dupes. so 1 2 3 4 is waht i want ;D kd ultibuzz	[reply]
Re: how to speed up dupe checking of arrays by mjscott2702 (Pilgrim) on Jul 31, 2007 at 12:35 UTC
The reduce method available at List::Util may make this faster. I haven't benchmarked it (getting some weird benchmark results on my Win32/Cygwin install), but these calls are supposed to be REALLY fast.	[reply]
Re^2: how to speed up dupe checking of arrays by mjscott2702 (Pilgrim) on Jul 31, 2007 at 12:40 UTC
Update: the module at List::MoreUtils actually has a uniq function that does exactly what you need. This may be faster?	[reply]
Re^3: how to speed up dupe checking of arrays by ultibuzz (Monk) on Jul 31, 2007 at 13:21 UTC
List::More-Utils is faster as my try but slower then the dupe checking while in while loop. problem might be the pass trough each element of the array wich becomee the main time consuming element. as i see it dosnt matte rmuch waht u use when u have kinda small arrays, but when u have arrays with several million elements, all the saved milliseconds count ;) kd ultibuzz	[reply]
Re: how to speed up dupe checking of arrays by radiantmatrix (Parson) on Jul 31, 2007 at 14:45 UTC
Just a thought that a DB could be leveraged here? `LOAD DATA INFILE` is pretty fast, and `SELECT DISTINCT` is both easy to code and pretty quick... <–radiant.matrix–> Ramblings and references The Code that can be seen is not the true Code I haven't found a problem yet that can't be solved by a well-placed trebuchet	[reply] [d/l] [select]
Re^2: how to speed up dupe checking of arrays by ultibuzz (Monk) on Jul 31, 2007 at 14:58 UTC
we have a oracle 10g on a hp superdome loading all data with sqlloader and direct=true as option takes longer then the dupechecking with perl and then load filtered data in. loading the data with perl into the db or normal sqlloader without tunig woud take ages. with direct=true the sqlloader pumps in 10million rows in less then 20 sec, without direct=true it takes 10 minutes+ because orcale set commit points and check for data correctness ^^. other point is we need several indexes and partiotion groups on it wich woud take hours to create if we used the unfiltered data. kd ultibuzz	[reply]