Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Eliminate exact duplicates from array of hashes

by Anonymous Monk
on Oct 09, 2019 at 17:17 UTC ( [id://11107249]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello

Is there a standard (build-in) way in Perl to eliminate exact duplicates from an array of hashes. The following does not work:

use Data::Dumper; use List::MoreUtils qw(uniq); my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my @unique = uniq @test_data; print Dumper \@unique;

Replies are listed 'Best First'.
Re: Eliminate exact duplicates from array of hashes
by tybalt89 (Monsignor) on Oct 09, 2019 at 17:42 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11107249 use warnings; use List::Util qw(uniq); use Data::Dump qw(dd pp); my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my @unique = map eval, uniq map { pp $_ } @test_data; dd \@unique;

    Outputs:

    [ { Tag1 => 1, Tag2 => "a" }, { Tag1 => 1, Tag2 => "b" }, { Tag1 => 1, Tag2 => "c" }, { Tag1 => 2, Tag2 => "a" }, { Tag1 => 2, Tag2 => "d" }, { Tag1 => 3 }, { Tag1 => "sun", Tag2 => "a" }, ]

      This seems to do exactly what I was looking for. Thank you.

Re: Eliminate exact duplicates from array of hashes
by haukex (Archbishop) on Oct 09, 2019 at 17:44 UTC

    It depends a bit on the data. Can the hashes contain any other nested data structures, or is it really just an array of plain hashes, where the only values are strings? If you could say with absolute certainty that the keys and values will never contain a certain string, such as "\0", in that case you could use that as a string to join the key/value pairs of the hashes for a string comparison. Another option might be to stringify the hashes with Data::Dumper (a core module) and compare those strings ($Data::Dumper::Sortkeys needs to be on), although I personally don't think that's a very clean solution. Another (untested) idea might be to serialize the hashrefs with Storable for a more compact representation than what Data::Dumper produces (Fletch's JSON solution and tybalt89's Data::Dump solution are the same kind of idea). And if you want to do it with absolutely no modules at all (which I wouldn't really recommend), then you'll have to code it up in Perl, looping over the hashes to compare them, stepping deeper into the data structure if necessary.

    So please let us know of some more details of your data structure.

Re: Eliminate exact duplicates from array of hashes
by LanX (Saint) on Oct 09, 2019 at 19:09 UTC
    I really like tybalt89's approach here: Re: Eliminate exact duplicates from array of hashes.

    Unfortunately it requires an eval step...

    Here a way to avoid it

    use strict; use warnings; use Data::Dump qw/pp dd/; my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my %seen; my @unique = grep { not $seen{pp $_}++ } @test_data; #pp \%seen; pp \@unique;

    Sadly uniq doesn't offer to provide an optional block (analog to sort ) to emulate this behavior with code like

    uniq { pp $_ } @test_data

    Please be aware of possible side effects when having circular data.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Here a version with Data::Dumper which is core. (though Data::Dump is always my first installation)

      use strict; use warnings; use Data::Dumper; use Test::More; sub uniq_nds{ # uniqe nested data-structures my %seen; local $Data::Dumper::Sortkeys = 1; local $Data::Dumper::Terse = 1; grep { not $seen{Dumper $_}++ } @_; } my @test_data = ( { Tag1 => 1, Tag2 => "a" }, { Tag1 => 1, Tag2 => "a" }, { Tag1 => 1, Tag2 => "b" }, { Tag1 => 1, Tag2 => "c" }, { Tag1 => 1, Tag2 => "a" }, { Tag1 => 2, Tag2 => "a" }, { Tag1 => 2, Tag2 => "d" }, { Tag1 => 2, Tag2 => "a" }, { Tag1 => 3 }, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my $got = [ uniq_nds @test_data ]; my $expected = [ { Tag1 => 1, Tag2 => "a" }, { Tag1 => 1, Tag2 => "b" }, { Tag1 => 1, Tag2 => "c" }, { Tag1 => 2, Tag2 => "a" }, { Tag1 => 2, Tag2 => "d" }, { Tag1 => 3 }, { Tag1 => "sun", Tag2 => "a" }, ]; is_deeply( $got, $expected, 'uniq AoH' ) or diag Dumper $got; done_testing;

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Eliminate exact duplicates from array of hashes
by Fletch (Bishop) on Oct 09, 2019 at 17:38 UTC

    It does work, it's just that you don't understand what it's comparing and why they look different to uniq. Each element in your @test_data is a separate hashref which will compare differently with another hashref which might have identical contents.

    For this particular usage with this (relatively small) amount of data you might could kludge something up by serializing items and then checking the serialized version.

    Update: Ooops, forgot to enable sorting of keys with canonical(1). Updated output as well. I did say it was a kludge . . .

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Interesting approach, thank you. However, it does not seem to work. It eliminates only 1 duplicate, while there are much more exact duplicates in my array (I guess it has to do with the order of the couples key/values inside the hash).

Re: Eliminate exact duplicates from array of hashes
by NetWallah (Canon) on Oct 09, 2019 at 19:22 UTC
    Using core modules only, and assuming a flat hashref (i.e. not nested),
    The following should be faster than other methods that do external serializing:
    my %unique = map { my @k=sort(keys %$_); join("",@k, @$_{@k}) => $_ } @test_data; print Dumper [values %unique];

                    "From there to here, from here to there, funny things are everywhere." -- Dr. Seuss

      really?
      use strict; use warnings; use Data::Dumper; my @test_data = ( { a=>1, b=>2 }, { ab => 12} ); my %unique = map { my @k=sort(keys %$_); join("",@k, @$_{@k}) => $_ } @test_data; print Dumper [values %unique];

      $VAR1 = [ { 'ab' => 12 } ];

      updates

      • replacing "" in join($; ,@k, @$_{@k}) solves this issue *
      • please note that other solutions kept the order
      *) in most cases

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        OK - more robust version:
        use strict; use warnings; use Digest::MD5; use Data::Dumper; my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my @test2=( { a=>1, b=>2 }, { ab => 12} ); # for performance, MD5 is fastest. my %unique = map {my @k=sort(keys %$_); my $ctx=Digest::MD5->new() or die "Cannot make MD5 o +bj"; $ctx->add($_ . $;) for @k; $ctx->add("=>"); #separator for values $ctx->add($_ . "+") for @$_{@k}; $ctx->digest() => $_ } @test_data; print Dumper [values %unique]; %unique = map {my @k=sort(keys %$_); my $ctx=Digest::MD5->new() or die "Cannot make MD5 ob +j"; $ctx->add($_ . $;) for @k; $ctx->add("=>"); #separator for values $ctx->add($_ . "+") for @$_{@k}; $ctx->digest() => $_ } @test2; print Dumper [values %unique];

                        "From there to here, from here to there, funny things are everywhere." -- Dr. Seuss

Re: Eliminate exact duplicates from array of hashes
by johngg (Canon) on Oct 09, 2019 at 22:29 UTC

    A serialization approach:-

    use 5.026; use warnings; use Data::Dumper; my @test_data = ( { Tag1 => q{1}, Tag2 => q{a} }, { Tag1 => q{1}, Tag2 => q{a} }, { Tag1 => q{1}, Tag2 => q{b} }, { Tag1 => q{1}, Tag2 => q{c} }, { Tag1 => q{1}, Tag2 => q{a} }, { Tag1 => q{2}, Tag2 => q{a} }, { Tag1 => q{2}, Tag2 => q{d} }, { Tag1 => q{2}, Tag2 => q{a} }, { Tag1 => q{3} }, { Tag1 => q{sun}, Tag2 => q{a} }, { Tag1 => q{sun}, Tag2 => q{a} }, ); my @unique = do { my %seen; map { $_->[ 1 ] } grep { ! $seen{ $_->[ 0 ] } ++ } map { my $rhItem = $_; [ ( join qq{\x00}, map { join qq{\x00}, $_, $rhItem->{ $_ } } sort keys %{ $rhItem } ), $rhItem ] } @test_data; }; print Data::Dumper ->new( [ \ @unique ], [ qw{ *unique } ] ) ->Sortkeys( 1 ) ->Dumpxs();

    The output:-

    @unique = ( { 'Tag1' => '1', 'Tag2' => 'a' }, { 'Tag1' => '1', 'Tag2' => 'b' }, { 'Tag1' => '1', 'Tag2' => 'c' }, { 'Tag1' => '2', 'Tag2' => 'a' }, { 'Tag1' => '2', 'Tag2' => 'd' }, { 'Tag1' => '3' }, { 'Tag1' => 'sun', 'Tag2' => 'a' } );

    I hope this is of interest.

    Cheers,

    JohnGG

      This algorithm suffers from the same issue that LanX pointed out.

      The following data defeats the de-dup:

      my $null = qq{\x00}; my @test_data = ( { "a${null}1${null}b"=>"2" }, { a => 1, b => 2} );
      My second program gives the correct results, although it too could be defeated by sufficiently crafted data.

                      "From there to here, from here to there, funny things are everywhere." -- Dr. Seuss

Re: Eliminate exact duplicates from array of hashes
by Anonymous Monk on Oct 09, 2019 at 17:59 UTC

    Thank you for the feedback (and for confirming there is not a standard/build-in solution to do this). My @test_data should contain only hashes as in my example (and probably with a bit of cleaning in the pre-process) always with the two (identical) keys. The values of the hashes should any digit/literal (words and so on), possibly containing also characters such as ,|'". (unicode strings). The size of @test_data can be big (100.000 hashes), but performance (time) is not an issue in this context.

      Now knowing more about the problem i might try

      use strict; use warnings; use Data::Dumper; my @test_data = ( { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "1", Tag2 => "b" }, { Tag1 => "1", Tag2 => "c" }, { Tag1 => "1", Tag2 => "a" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "2", Tag2 => "d" }, { Tag1 => "2", Tag2 => "a" }, { Tag1 => "3"}, { Tag1 => "sun", Tag2 => "a" }, { Tag1 => "sun", Tag2 => "a" }, ); my %found; my @unique; for my $grp (@test_data) { my $t1=$grp->{Tag1}//''; my $t2=$grp->{Tag2}//''; next if ($found{$t1}{$t2}); push @unique,$grp; $found{$t1}{$t2}=1; } print Dumper \@unique;
      as it saves the cost of serializing and the cost of repeated storage of the characters Tag1/Tag2 at the expense of only checking the two keys AND representing the undefined key value as a zero length character scalar.

      $VAR1 = [ { 'Tag1' => '1', 'Tag2' => 'a' }, { 'Tag2' => 'b', 'Tag1' => '1' }, { 'Tag1' => '1', 'Tag2' => 'c' }, { 'Tag1' => '2', 'Tag2' => 'a' }, { 'Tag2' => 'd', 'Tag1' => '2' }, { 'Tag1' => '3' }, { 'Tag1' => 'sun', 'Tag2' => 'a' } ];

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11107249]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-03-29 10:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found