Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone,

SCENERIO:- We have a huge list of log files (about 50k). I am trying to create a user list for each url.

MY PROBLEM:- How do I remove duplicates values (not keys) in a hash like this. @{$is{$a}}

MY IDEA:- I have already tried using this.  @{$is{$a}} = grep {! $temp{$_}++ } @{$is{$a}};

THE PROBLEM:-The problem with the above is that, while it does seem to work for a lot of key value pairs, I found some of the values sneek in duplicate values.

I am running out of ideas about what to do here. Please give me any suggestions.

Thank you

Edit: g0n - code tags & formatting

Replies are listed 'Best First'.
Re: remove duplicates
by ayrnieu (Beadle) on Mar 11, 2006 at 07:07 UTC
    %a = reverse %a; %a = reverse %a;
Re: remove duplicates
by Corion (Patriarch) on Mar 11, 2006 at 07:10 UTC

    If your keysvalues look duplicate, most likely, there are things you don't see. Most likely it's whitespace at the end of one keyvalue. Inspect your hash through Data::Dumper, and best concentrate on keysvalue that should be equal but aren't:

    use strict; use Data::Dumper; my %hash = ( 'Hello ' => 'world', 'The' => 'world ', ); print Dumper \%hash;

    If your hash is too large to conveniently dump, you can find out one of the values that are duplicate and create a copy of the hash:

    for (keys %bad_hash) { if ($bad_hash{$_} =~ /orl/) { # because we're looking for "hello" $hash{ $_ } = $bad_hash{ $_ } } } print Dumper \%bad_hash;

    You also might run into encoding issues where two different octet sequences (that actually compare as unequal) encode to the same glyph sequence. But as Perl 5.8 internally uses UTF-8, that shouldn't be a problem. In any case, it would help to see some more yet still small code and some really small dataset (2 lines) that reproduces the problem.

    Update: Realized this is about values, not keys.

      Hi again, I tried to clean up the data as suggested, but isnt there some other way of making sure a key doesnt have duplicate values. (especially if the logs you r reading are > 25k) Please help

        I'm not sure I understand where your problem lies. A hash is the traditional way in Perl to check for duplicates. If your memory gets too small because you have too many different entries, you can use DB_File or any other tied hash that stores its data on disk instead of memory.

        If all that fails, you can always use a database.