DunLidjun has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to remove duplicates from a HoH. I've almost got it, I think, but I'm missing some key element. Any chance you can spot my mistake? I've been at it two days.

The data is moved into a HoH using the HoH example in the "Programming Perl, 3rd Ed". I've included the some of the offending data (pulled from an array in memory) and listed the import hash function as well as the duplicate removal section of code.

I'm trying to remove duplicate partnum's from the data and combine the tags.

#!/usr/bin/perl # #use strict; use warnings; use File::Find; use List::Compare; use Data::Dumper; use feature "switch"; ## @tmp2 is the data given in the code snippet below my $who=(); my $field=(); foreach (@tmp2) { s/^(.*?):\s*//; $who=$1; $rec = {}; $HoH{$who}=$rec; for $field (split(/,/)) { ($key,$value) = split /=/, $field; $rec->{$key} =$value; } } ## Testing to see if the hash is working. ## Uncomment Section below to test. my $iter=(); my $iter2=(); for $iter (keys %HoH) { print "$iter: "; for $iter2 ( keys %{ $HoH{$iter}} ) { print "$iter2=$HoH{$iter}{$iter2} ,"; } print "\n"; } ############################################## ## ## Remove duplicate part numbers and combine the tags, if necessary. ## ## print "######################################"; print "Duplicate Removal Testing\n"; my %tmpHoH=%HoH; my $keyouter; my $valueouter; my $keydup; my $valuedup; my @keydelete=(); while ( ($keyouter,$valueouter)= each %HoH){ while ( ($keydup,$valuedup) = each %tmpHoH){ if($keyouter eq $keydup){ next; } if($HoH{$keyouter}{partnum} eq $tmpHoH{$keydup}{partnum}){ print "**********\n"; print "key: $keyouter, value: $HoH{$keyouter}{partnum}, tag= $HoH{ +$keyouter}{tags}\n"; print "key: $keydup, value: $tmpHoH{$keydup}{partnum}, tag= $tmpHo +H{$keydup}{tags}\n"; $HoH{$keyouter}{tags}= $HoH{$keyouter}{tags} ." ". $tmpHoH{$keydup +}{tags}; print "$tmpHoH{$keyouter}{tags}\n"; print "**********\n"; push(@keydelete,"$keyouter=>$keydup"); } } } print @keydelete;

The Data:

413: partnum=2204133000,description=PRESS GAUGE,quantity=1.0000,tags=P +I-412 414: partnum=2202261000,description=THERMOWELL,quantity=2.0000,tags= 415: partnum=2201176000,description=THERMOMETER,quantity=2.0000,tags= 581: partnum=2204227002,description=TEMP TRANSMITTER,quantity=1.0000,t +ags=TE/TT-102 582: partnum=2201176000,description=THERMOMETER,quantity=3.0000,tags=T +I-100 TI-101 TI-200 576: partnum=2204133000,description=PRESS GAUGE,quantity=1.0000,tags=P +I-400

Any help would be highly appreciated! Thank you, Shawn Way

Replies are listed 'Best First'.
Re: Removing Duplicates in a HoH
by toolic (Bishop) on Dec 20, 2010 at 14:25 UTC
    It might make it a lot easier for others to help you if you provide a self-contained example for others to run to duplicate the problem you are having. You should populate your @tmp2 array for us. Then show the output you get along with the output you expect.

      Duly noted. Thanks for the tip on perltidy. I'd never heard of that script before.

      What I am looking for is to remove the duplicate part numbers and combine the tags (and add the quantities) similar to below:

      413: quantity=2.0000 ,description=PRESS GAUGE ,tags=PI-412 PI-400 ,par +tnum=2204133000 , 414: quantity=2.0000 ,description=THERMOWELL ,tags= ,partnum=220226100 +0 , 415: quantity=5.0000 ,description=THERMOMETER ,tags= TI-100 TI-101 TI- +200 ,partnum=2201176000 ,

      The original information is below:

      413: partnum=2204133000,description=PRESS GAUGE,quantity=1.0000,tags=P +I-412 , 414: partnum=2202261000,description=THERMOWELL,quantity=2.0000,tags= , 415: partnum=2201176000,description=THERMOMETER,quantity=2.0000,tags= +, 576: partnum=2204133000,description=PRESS GAUGE,quantity=1.0000,tags=P +I-400 , 582: partnum=2201176000,description=THERMOMETER,quantity=3.0000,tags=T +I-100 TI-101 TI-200

      The code is below:

      #!/usr/bin/perl # use strict; use warnings; my @tmp2 = ( "413: partnum=2204133000,description=PRESS GAUGE,quantity=1.0000,tags= +PI-412", "414: partnum=2202261000,description=THERMOWELL,quantity=2.0000,ta +gs=", "415: partnum=2201176000,description=THERMOMETER,quantity=2.0000,t +ags=", "576: partnum=2204133000,description=PRESS GAUGE,quantity=1.0000,tags= +PI-400", "582: partnum=2201176000,description=THERMOMETER,quantity=3.0000,tags= +TI-100 TI-101 TI-200" ); # # Move the data into a hash %HoH # my $who = (); my $field = (); my %HoH = (); my $rec = (); my $key = (); my $value = (); foreach (@tmp2) { s/^(.*?):\s*//; $who = $1; $rec = {}; $HoH{$who} = $rec; for $field ( split(/,/) ) { ( $key, $value ) = split /=/, $field; $rec->{$key} = $value; } } print "\nHash complete\n"; ############################################## ## ## Remove duplicate part numbers and combine the tags, if necessary. ## ## print "######################################"; print "Duplicate Removal Testing\n"; my %tmpHoH = %HoH; my $keyouter; my $valueouter; my $keydup; my $valuedup; my @keydelete = (); while ( ( $keyouter, $valueouter ) = each %HoH ) { while ( ( $keydup, $valuedup ) = each %tmpHoH ) { if ( $keyouter eq $keydup ) { next; } if ( $HoH{$keyouter}{partnum} eq $tmpHoH{$keydup}{partnum} ) { print "**********\n"; print "key: $keyouter, value: $HoH{$keyouter}{partnum}, tag= $HoH{$keyouter} +{tags}\n"; print "key: $keydup, value: $tmpHoH{$keydup}{partnum}, tag= $tmpHoH{$keydup} +{tags}\n"; $HoH{$keyouter}{tags} = $HoH{$keyouter}{tags} . " " . $tmpHoH{$keydup}{tags}; print "$tmpHoH{$keyouter}{tags}\n"; print "**********\n"; push( @keydelete, "$keyouter=>$keydup" ); } } } print "######################################\n"; ############################################## ############################################# # ## Testing to see if the hash is working. ## Uncomment Section below to test. my $iter = (); my $iter2 = (); for $iter ( keys %HoH ) { print "$iter: "; for $iter2 ( keys %{ $HoH{$iter} } ) { print "$iter2=$HoH{$iter}{$iter2} ,"; } print "\n"; }
Re: Removing Duplicates in a HoH
by scorpio17 (Canon) on Dec 20, 2010 at 15:09 UTC

    Here's how I'd do it:

    use strict; use Data::Dumper; my $delimiter = "|"; # delimiter for multiple tags my %data; while(my $line = <DATA>) { chomp $line; $line =~ s/^(.*?):\s*//; # remove leading number and colon, and any +whitespace my $rec = {}; my $partno; for my $field ( split(/,/, $line) ) { my ($key, $value) = split(/=/, $field); if ($key eq 'partnum') { $partno = $value; } else { $rec->{$key} = $value; } } if ( defined $data{ $partno } ) { if ( $data{$partno}{'tags'} ) { $data{$partno}{'tags'} .= $delimiter . $rec->{'tags'}; } else { $data{$partno}{'tags'} = $rec->{'tags'}; } $data{$partno}{'quantity'} += $rec->{'quantity'}; unless ($data{$partno}{'description'} eq $rec->{'description'}) { warn "Multiple descriptions for $partno ! \n"; } } else { $data{ $partno } = $rec; } } print Dumper(\%data), "\n"; __DATA__ 413: partnum=2204133000,description=PRESS GAUGE,quantity=1.0000,tags=P +I-412 414: partnum=2202261000,description=THERMOWELL,quantity=2.0000,tags= 415: partnum=2201176000,description=THERMOMETER,quantity=2.0000,tags= 581: partnum=2204227002,description=TEMP TRANSMITTER,quantity=1.0000,t +ags=TE/TT-102 582: partnum=2201176000,description=THERMOMETER,quantity=3.0000,tags=T +I-100 TI-101 TI-200 576: partnum=2204133000,description=PRESS GAUGE,quantity=1.0000,tags=P +I-400

    Notes:

    • I use the part number as the main hash key.
    • I use 'defined' to check for duplicates.
    • If I find a duplicate, I append the tag, if another tag exists. (it would be better to keep the tags in an array, to avoid having to split this list later)
    • I separate tags using the character in $delimiter, so you can make it whatever you like.
    • If I find a duplicate, I combine the quantities, and sanity check the descriptions (should be the same?) - consider this optional.

    I get this output:

    $VAR1 = { '2201176000' => { 'quantity' => 5, 'description' => 'THERMOMETER', 'tags' => 'TI-100 TI-101 TI-200' }, '2204133000' => { 'quantity' => 2, 'description' => 'PRESS GAUGE', 'tags' => 'PI-412|PI-400' }, '2202261000' => { 'quantity' => '2.0000', 'description' => 'THERMOWELL', 'tags' => '' }, '2204227002' => { 'quantity' => '1.0000', 'description' => 'TEMP TRANSMITTER', 'tags' => 'TE/TT-102' } };
Re: Removing Duplicates in a HoH
by state-o-dis-array (Hermit) on Dec 20, 2010 at 14:50 UTC
    It's not clear what you are intending the code to actually do, but perhaps the issue isn't how to remove "duplicate" keys. Perhaps what you really ought to consider is if there is a better way to store the data in the first place such that you don't need to remove the duplicate keys? I notice that THERMOMETER has two different values for quantity, is arbitrarily removing one of the thermometer entries what you really want? I don't know, since I don't know if the quantity data is important.

    Anyway, my main point is that it might be better to store your data by part number.

    $HoH{$part_number}{quantity}= $qty; $HoH{$part_number}{description} = $description; ...

      I normally would store the information by part number however the file that produces this information is extremely large and has duplicate part numbers. This script is actually trying to combine and reduce the data to single part numbers as well as group the tags and add the quantities.

        Storing the information by part number does what you are looking to accomplish, see the response of scorpio17 which provides an example of what I'm talking about.
Re: Removing Duplicates in a HoH
by hbm (Hermit) on Dec 20, 2010 at 14:38 UTC

    As you are building the HoH, why not store partnums in a temporary hash; and not add to the HoH any record whose partnum has already been seen?