tnyflmngs has asked for the wisdom of the Perl Monks concerning the following question:

Hey folks, My database currently has just over 50,000 equipment records. A couple thousand of those are duplicated in some fashion or another. What I need to do is to identify those records. All was going well and I actually got the records identified using an array of arrays. A little about my solution so far: I extracted all serial number / id field pairs, then I removed all non word characters, and some other tokens that I knew would be a problem. What I got was a text file that had the clean serial number and all id fields on the same line, which is exactly what I wanted. Then I got to thinking, "Man it would be nice if I had the rest of the information so that I don't have to individually look these up." Sounded pretty easy, I decided to use a hash of arrays of arrays, this is where it got dicey. My code
for my $row ( @{$serials}) { my $equ = $$row[$equIndex]; my $pmf = $$row[$pmfIndex]; my $pro = $$row[$proIndex]; my $serial = $$row[$serialIndex]; my $usr = $$row[$usrIndex]; my $date = $$row[$dateIndex]; my $clean = $$row[$cleanIndex]; if ($duplicates{$clean}) { push (@{$duplicates{$clean}}, [$equ, $pmf, $pro, $seri +al, $usr, $date]); } else { %duplicates = ($clean => [$equ, $pmf, $pro, $serial, $usr, + $date]); } }
This is the offending snippet. What this does is create a hash key from the clean serial. The first value is created as an array, which is fine, but when a duplicate comes around I push it on and it adds it to the end of the first array, inside the array. What I want is:
clean1 -> [[equ info] -> [equ info]] clean2 -> [equ info] clean3 -> [[equ info] -> [equ info] -> [equ info]]
so I can then print everything with an outer array length greater than 1 to a file and only get the duplicates. What I am getting is
clean1 -> [equ info, array] clean2 -> [equ info] clean3 -> [equ info, array, array]
I have tried pushing the values into arrays first. I tried using two arrays and pushing the values onto 1 and then push that array onto another. What I do know is that I am making this much harder than it is, but I am stumped.

Replies are listed 'Best First'.
Re: hash of arrays of arrays
by philiprbrenan (Monk) on Aug 23, 2012 at 22:50 UTC
    for my $row ( @{$serials}) { my $equ = $$row[$equIndex]; my $pmf = $$row[$pmfIndex]; my $pro = $$row[$proIndex]; my $serial = $$row[$serialIndex]; my $usr = $$row[$usrIndex]; my $date = $$row[$dateIndex]; my $clean = $$row[$cleanIndex]; push @{$duplicates{$clean}}, [$equ, $pmf, $pro, $serial, $usr, + $date]; } }

    After the loop, check to see which hash entries have arrays with only one element

Re: hash of arrays of arrays
by Cristoforo (Curate) on Aug 23, 2012 at 23:06 UTC
    I believe philiprbrenan found your problem. Every time you found an entry with a new '$clean' key, you were wiping out the other entries in the hash with: %duplicates = ($clean => [$equ, $pmf, $pro, $serial, $usr, $date]);

    Here is a code similiar to your problem that prints the duplicates.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %duplicates; my $serials = [ [qw/ foo bar serial1 /], [qw/ who now serial2 /], [qw/ one more serial2 /], [qw/ two end serial3 /] ]; for my $row (@$serials) { my $clean = $row->[2]; push @{$duplicates{$clean}}, $row; } print Dumper \%duplicates; for my $clean (keys %duplicates) { my $aref = $duplicates{$clean}; if (@$aref > 1) { # if duplicates for my $row (@$aref) { print "@$row\n"; } } }
    Output:

    C:\Old_Data\perlp>perl t7.pl $VAR1 = { 'serial3' => [ [ 'two', 'end', 'serial3' ] ], 'serial1' => [ [ 'foo', 'bar', 'serial1' ] ], 'serial2' => [ [ 'who', 'now', 'serial2' ], [ 'one', 'more', 'serial2' ] ] }; who now serial2 one more serial2
    Chris

    Update: Don't know why I pulled $clean out. This version doesn't.

      Thanks both of you. I have seen people on here talking about Data::Dumper, guess I should make the time to mess with it. I will take a look at it in the morning. I guess I thought that...who knows what I thought, it seems silly as I say it out loud. For some reason I thought that I had to be careful not to clobber my hash, but it appears that is exactly what I tried to do.
        As another suggestion, this $$xx[blah] notation is confusing. Here $row is a reference. I prefer the arrow notation. This makes it more clear that $row is a reference.
        for my $row ( @{$serials}) { my $equ = $row->[$equIndex]; my $pmf = $row->[$pmfIndex]; my $pro = $row->[$proIndex]; my $serial = $row->[$serialIndex]; my $usr = $row->[$usrIndex]; my $date = $row->[$dateIndex]; my $clean = $row->[$cleanIndex]; #blah..... }