hash of arrays of arrays

tnyflmngs has asked for the wisdom of the Perl Monks concerning the following question:

Hey folks, My database currently has just over 50,000 equipment records. A couple thousand of those are duplicated in some fashion or another. What I need to do is to identify those records. All was going well and I actually got the records identified using an array of arrays. A little about my solution so far: I extracted all serial number / id field pairs, then I removed all non word characters, and some other tokens that I knew would be a problem. What I got was a text file that had the clean serial number and all id fields on the same line, which is exactly what I wanted. Then I got to thinking, "Man it would be nice if I had the rest of the information so that I don't have to individually look these up." Sounded pretty easy, I decided to use a hash of arrays of arrays, this is where it got dicey. My code

for my $row ( @{$serials}) {
        my $equ = $$row[$equIndex];
        my $pmf = $$row[$pmfIndex];
        my $pro = $$row[$proIndex];
        my $serial = $$row[$serialIndex];
        my $usr = $$row[$usrIndex];
        my $date = $$row[$dateIndex];
        my $clean = $$row[$cleanIndex];
        if ($duplicates{$clean}) {
                push (@{$duplicates{$clean}}, [$equ, $pmf, $pro, $seri
+al, $usr, $date]);
        } else {
            %duplicates = ($clean => [$equ, $pmf, $pro, $serial, $usr,
+ $date]);
        }
    }
[download]

This is the offending snippet. What this does is create a hash key from the clean serial. The first value is created as an array, which is fine, but when a duplicate comes around I push it on and it adds it to the end of the first array, inside the array. What I want is:

clean1 -> [[equ info] -> [equ info]]
clean2 -> [equ info]
clean3 -> [[equ info] -> [equ info] -> [equ info]]
[download]

so I can then print everything with an outer array length greater than 1 to a file and only get the duplicates. What I am getting is

clean1 -> [equ info, array]
clean2 -> [equ info]
clean3 -> [equ info, array, array]
[download]

I have tried pushing the values into arrays first. I tried using two arrays and pushing the values onto 1 and then push that array onto another. What I do know is that I am making this much harder than it is, but I am stumped.

Comment on hash of arrays of arrays Select or Download Code

Replies are listed 'Best First'.
Re: hash of arrays of arrays by philiprbrenan (Monk) on Aug 23, 2012 at 22:50 UTC
`for my $row ( @{$serials}) { my $equ = $$row[$equIndex]; my $pmf = $$row[$pmfIndex]; my $pro = $$row[$proIndex]; my $serial = $$row[$serialIndex]; my $usr = $$row[$usrIndex]; my $date = $$row[$dateIndex]; my $clean = $$row[$cleanIndex]; push @{$duplicates{$clean}}, [$equ, $pmf, $pro, $serial, $usr, + $date]; } }` [download] After the loop, check to see which hash entries have arrays with only one element	[reply] [d/l]
Re: hash of arrays of arrays by Cristoforo (Curate) on Aug 23, 2012 at 23:06 UTC
I believe philiprbrenan found your problem. Every time you found an entry with a new '$clean' key, you were wiping out the other entries in the hash with: `%duplicates = ($clean => [$equ, $pmf, $pro, $serial, $usr, $date]);` Here is a code similiar to your problem that prints the duplicates. `#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %duplicates; my $serials = [ [qw/ foo bar serial1 /], [qw/ who now serial2 /], [qw/ one more serial2 /], [qw/ two end serial3 /] ]; for my $row (@$serials) { my $clean = $row->[2]; push @{$duplicates{$clean}}, $row; } print Dumper \%duplicates; for my $clean (keys %duplicates) { my $aref = $duplicates{$clean}; if (@$aref > 1) { # if duplicates for my $row (@$aref) { print "@$row\n"; } } }` [download] Output: `C:\Old_Data\perlp>perl t7.pl $VAR1 = { 'serial3' => [ [ 'two', 'end', 'serial3' ] ], 'serial1' => [ [ 'foo', 'bar', 'serial1' ] ], 'serial2' => [ [ 'who', 'now', 'serial2' ], [ 'one', 'more', 'serial2' ] ] }; who now serial2 one more serial2` [download] Chris Update: Don't know why I pulled `$clean` out. This version doesn't.	[reply] [d/l] [select]
Re^2: hash of arrays of arrays by tnyflmngs (Acolyte) on Aug 24, 2012 at 02:21 UTC
Thanks both of you. I have seen people on here talking about Data::Dumper, guess I should make the time to mess with it. I will take a look at it in the morning. I guess I thought that...who knows what I thought, it seems silly as I say it out loud. For some reason I thought that I had to be careful not to clobber my hash, but it appears that is exactly what I tried to do.	[reply]
Re^3: hash of arrays of arrays by Marshall (Canon) on Aug 24, 2012 at 11:41 UTC
As another suggestion, this `$$xx[blah]` notation is confusing. Here $row is a reference. I prefer the arrow notation. This makes it more clear that $row is a reference. `for my $row ( @{$serials}) { my $equ = $row->[$equIndex]; my $pmf = $row->[$pmfIndex]; my $pro = $row->[$proIndex]; my $serial = $row->[$serialIndex]; my $usr = $row->[$usrIndex]; my $date = $row->[$dateIndex]; my $clean = $row->[$cleanIndex]; #blah..... }` [download]	[reply] [d/l] [select]