Finding Duplicates and Deleting in a Complex Data Structure

GuiPerl has asked for the wisdom of the Perl Monks concerning the following question:

I am growing the following data structure from a flat file. The structure is as follows:

my @Divisions = qw(ABER BERF CECC DADD);
my @rows;
my %AG;
my $Rec= {};
my %positions;

my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1},encoding =
+> "utf-8");

 
 #CSV file with comma delimited data
 
 open my $fh1, "<", "test.csv" or die "test.csv: $!";
 
 while (my $row = $csv->getline ($fh1)) {
     # do something with @$row


     
     
     if ($row->[12]) {
        push @rows, $row;
     
     }
     
     else {
        
        push @rows, $row;
     }
     
     
     }
 close $fh1 or die "data.csv: $!";



    
    
 foreach my $rec (@rows) {
        
        
        
        
      foreach my $dept (@Divisions) {
        
        
            if ($rec->[14] =~ /^$dept/ && $rec->[15] =~ /^A\W[1-5]|B\W
+[1-2]/) {
      
            
            my $Rec = { 
                          
                             
                             SECTION=>$rec->[0],
                             GRADE=>strip_hyphen($rec->[1]),
                             POSITION=>$rec->[2],
                             NAME=>invert_name($rec->[3]),
                             AGE =>convert_date($rec->[4]),
                             GENDER=>$rec->[5],
                             
           };
            

            push @{$AG{$rec->[10]}},$Rec;
            

            } 
 
            
            
            }
      
 }
      
      

 
foreach my $A (sort keys %AG) {
   
   
   foreach my $p (@{$AG{$A}})  {
          
        print $p->{'GRADE'}," ", $p->{'NAME'}," ",$p->{'POSITION'},$p-
+>{'AGE'}," ",$p->{'GENDER'}, "\n";
           
   }
   
}
[download]

Output of sample data structure using Dump:



VAR1 = 'ABER - Advanced Technologies';
$VAR2 = [
          {
            
            'NAME' => 'J. Green',
            'DATE_OF_BIRTH' => '8/18/1959',
            'SECTION' => 'ABER',
            'POSITION' => 'DIRECTOR',
            'AGE' => 55,
            'GRADE' => 'B2'
          }
        ];
$VAR3 = 'BERF - Satellite Research';
$VAR4 = [
          {

            'NAME' => 'P. Smith',
            'DATE_OF_BIRTH' => '12/11/1957',
            'SECTION' => 'BERF',
            'POSITION' => 'CHIEF',
            'AGE' => 56,
            'GRADE' => 'B1'
          },
          {
            
            'NAME' => 'R. Forest',
            'DATE_OF_BIRTH' => '1/18/1954',
            'SECTION' => 'BERF',
            'POSITION' => 'SENIOR OFFICER',
            'AGE' => '60 GREEN',
            'GRADE' => 'A5'
          },
          {
           
            'NAME' => 'R.Forest',
            'DATE_OF_BIRTH' => '03/09/1964',
            'SECTION' => 'BERF',
            'POSITION' => 'SENIOR OFFICER',
            'AGE' => 'Vacant',
            'GRADE' => 'A5'
          },
          {

            'NAME' => 'K. King',
            'DATE_OF_BIRTH' => '8/9/1960',
            'SECTION' => 'BERF',
            'POSITION' => 'SENIOR OFFICER',
            'AGE' => 54,
            'FEMALE' => '',
            'GRADE' => 'A5'
          },

        ];
[download]

What I need to do is to count the number of duplicate GRADE keys (i.e. B1, A5,A4,etc.) and then to delete the GRADE key value so that it does not appear in the output if there is more than 1 GRADE key value of the same type.

Expected Output:

B2, J. Green, DIRECTOR,55,M
B1,P.Smith,CHIEF,54,M
A5,R.Forest,SENIOR OFFICER,60,M
   K.King,SENIOR OFFICER,54, M (A5 is excluded because it appears more
+ than once)
   P.Turner,50, M (A5 is excluded because it appears more than once)
[download]

Any pointers would really be appreciated.

Comment on Finding Duplicates and Deleting in a Complex Data Structure Select or Download Code

Replies are listed 'Best First'.
Re: Finding Duplicates and Deleting in a Complex Data Structure by hdb (Monsignor) on Sep 05, 2014 at 13:21 UTC
Instead of removing duplicate grades you should just suppress the printing of them. If you first sort your data by grades and then not print repeated grades, you should get what you want. It could look like this (based on simplified data): use strict; use warnings; my $data = [ { 'NAME' => 'J. Green', 'GRADE' => 'B2' }, { 'NAME' => 'P. Smith', 'GRADE' => 'B1' }, { 'NAME' => 'R. Forest', 'GRADE' => 'A5' }, { 'NAME' => 'R.Forest', 'GRADE' => 'A5' }, { 'NAME' => 'K. King', 'GRADE' => 'A5' }, ]; my $previous_grade = ''; for my $item ( sort { $a->{'GRADE'} cmp $b->{'GRADE'} } @$data ) { my( $grade, $name ) = ( $item->{'GRADE'}, $item->{'NAME'} ); print $grade eq $previous_grade ? ( ' ' x ( length( $grade )+1 ) ) + : "$grade,"; print "$name\n"; $previous_grade = $grade; } [download] gives you `A5,R. Forest R.Forest K. King B1,P. Smith B2,J. Green` [download]	[reply] [d/l] [select]
Re^2: Finding Duplicates and Deleting in a Complex Data Structure by GuiPerl (Acolyte) on Sep 05, 2014 at 15:09 UTC
Thanks a million. By the way, how would I count the number of B2, A5s etc?	[reply]
Re^3: Finding Duplicates and Deleting in a Complex Data Structure by hdb (Monsignor) on Sep 05, 2014 at 15:23 UTC
You could use a hash and count within the loop: `my $previous_grade = ''; my %grade_count; for my $item ( sort { $a->{'GRADE'} cmp $b->{'GRADE'} } @$data ) { my( $grade, $name ) = ( $item->{'GRADE'}, $item->{'NAME'} ); print $grade eq $previous_grade ? ( ' ' x ( length( $grade )+1 ) ) + : "$grade,"; print "$name\n"; $previous_grade = $grade; $grade_count{$grade}++; }` [download]	[reply] [d/l]