remove duplicates

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone,

SCENERIO:- We have a huge list of log files (about 50k). I am trying to create a user list for each url.

MY PROBLEM:- How do I remove duplicates values (not keys) in a hash like this. @{$is{$a}}

MY IDEA:- I have already tried using this. @{$is{$a}} = grep {! $temp{$_}++ } @{$is{$a}};

THE PROBLEM:-The problem with the above is that, while it does seem to work for a lot of key value pairs, I found some of the values sneek in duplicate values.

I am running out of ideas about what to do here. Please give me any suggestions.

Thank you

Edit: g0n - code tags & formatting

Comment on remove duplicates Select or Download Code

Replies are listed 'Best First'.
Re: remove duplicates by ayrnieu (Beadle) on Mar 11, 2006 at 07:07 UTC
`%a = reverse %a; %a = reverse %a;`	[reply] [d/l]
Re: remove duplicates by Corion (Patriarch) on Mar 11, 2006 at 07:10 UTC
If your ~~keys~~values look duplicate, most likely, there are things you don't see. Most likely it's whitespace at the end of one ~~key~~value. Inspect your hash through Data::Dumper, and best concentrate on ~~keys~~value that should be equal but aren't: `use strict; use Data::Dumper; my %hash = ( 'Hello ' => 'world', 'The' => 'world ', ); print Dumper \%hash;` [download] If your hash is too large to conveniently dump, you can find out one of the values that are duplicate and create a copy of the hash: `for (keys %bad_hash) { if ($bad_hash{$_} =~ /orl/) { # because we're looking for "hello" $hash{ $_ } = $bad_hash{ $_ } } } print Dumper \%bad_hash;` [download] You also might run into encoding issues where two different octet sequences (that actually compare as unequal) encode to the same glyph sequence. But as Perl 5.8 internally uses UTF-8, that shouldn't be a problem. In any case, it would help to see some more yet still small code and some really small dataset (2 lines) that reproduces the problem. Update: Realized this is about values, not keys.	[reply] [d/l] [select]
Re^2: remove duplicates by Anonymous Monk on Mar 11, 2006 at 23:12 UTC
Hi again, I tried to clean up the data as suggested, but isnt there some other way of making sure a key doesnt have duplicate values. (especially if the logs you r reading are > 25k) Please help	[reply]
Re^3: remove duplicates by Corion (Patriarch) on Mar 12, 2006 at 15:52 UTC
I'm not sure I understand where your problem lies. A hash is the traditional way in Perl to check for duplicates. If your memory gets too small because you have too many different entries, you can use DB_File or any other tied hash that stores its data on disk instead of memory. If all that fails, you can always use a database.	[reply]