Eliminating Duplicate Lines From A CSV File

country1 has asked for the wisdom of the Perl Monks concerning the following question:

I have 3 CSV files which I am using as input. They do not necessarily have identical server names nor are they necessarily in the same sort order. Also a particular input file may have more than 1 line with the same server name (see Server wsomddvfxa01 in File A.

File A 
Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util
WSOMQAVPRA05,93.75,95.87,66.67,68.13
wsomdavpra03,90.39,94,65.77,68.51
wsomddvfxa01,39.22,92.19,82.59,88.25
wsomddvfxa01,35.45,89.23,79.89,83.24

File B
Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util
WSOMQAVPRA05,34.78,100,55.1,67.6
wsomdavpra03,69.04,98.55,84.07,89.73
wsomddvfxa01,92.44,97.54,67.72,71.69
wsompapgtw05,48.77,96.9,92.1,93.55

File C
Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util
WSOMQAVPRA05,93.13,98.11,68.95,73.47
wsomdavpra03,68.85,97.56,76.35,98.23
wsomddvfxa01,46.97,96.29,88.23,94.02
wsompapgtw05,30.66,93.74,39.89,71.35
[download]

What I am trying to do is for each Server (in Column 1) I want to get the Avg CPU (in Column 2) from each of the 3 files. In the files above this would produce

File OUT
WSOMQAVPRA05,93.75,34.78,93.13
wsomdavpra03,90.39,69.04,68.85
wsomddvfxa01,39.22,92.44,46.97
wsompapgtw05,0,48.77,30.66
[download]

Notice in File OUT for Server Name (wsompapgtw05), since wsompapgtw05 does not appear in File A, the value is replaced with '0'. I was able to do this in my perl code (i.e., place a '0' in the output file when a particular server name appears in at least 1 of the input files, but not in all of the input files).

I have not yet been able to accomplish the 2nd desired task. Notice in File OUT the values for Server Name (wsomddvfxa01) are taken from the first line in File A. I want my perl code to pick out the Avg CPU value from the first data line when there are more than one line for a given Server Name in one of the input CSV data files.

My perl code looks as follows:

 #!/usr/bin/perl

 use strict;
 use warnings;

 ################################################
 #
 #  Create File with Avg CPU Numbers for each Server  #  Name
 #
 ###############################################

  my %usage;
  my $files = 0;

  for my $file (”sfull1ns.dat”,”sfull2ns.dat”,”sfull3ns.dat”)  {

  open (my $fh,”<”,$file) or die “Can’t open file $file: $!”;

  while (my $line = <$fh>) {

  chomp($line);
   my ($server, @data) = (split(“,”,$line));
  
   if ($data[0] lt “!” )  {
      $data[0] = 0;
   }

   next if grep /[^0-9.]/, @data;

   $usage{$server} = [] unless exists $usage{$server};
   push @{$usage{$server}}, 0 while @{$usage{$server}}                
+ < $files;
    push @{$usage{$server}}, $data[0];
    }
    continue {
      $files++ if eof;
    }

    close $fh or die “Can’t close file $file: $!”;
[download]

Code tags added by GrandFather

Comment on Eliminating Duplicate Lines From A CSV File Select or Download Code

Replies are listed 'Best First'.
Re: Eliminating Duplicate Lines From A CSV File by GrandFather (Saint) on Jul 24, 2007 at 21:58 UTC
use strict; use warnings; my %usage; my @sfull1ns = split /\n/, <<'FILE'; Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,93.75,95.87,66.67,68.13 wsomdavpra03,90.39,94,65.77,68.51 wsomddvfxa01,39.22,92.19,82.59,88.25 wsomddvfxa01,35.45,89.23,79.89,83.24 FILE my @sfull2ns = split /\n/, <<FILE; Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,34.78,100,55.1,67.6 wsomdavpra03,69.04,98.55,84.07,89.73 wsomddvfxa01,92.44,97.54,67.72,71.69 wsompapgtw05,48.77,96.9,92.1,93.55 FILE my @sfull3ns = split /\n/, <<FILE; Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,93.13,98.11,68.95,73.47 wsomdavpra03,68.85,97.56,76.35,98.23 wsomddvfxa01,46.97,96.29,88.23,94.02 wsompapgtw05,30.66,93.74,39.89,71.35 FILE for my $fileData (['File A', \@sfull1ns], ['File B', \@sfull2ns], ['Fi +le C', \@sfull3ns],) { my ($filename, $data) = @$fileData; shift @$data; for my $line (@$data) { chomp($line); my ( $server, @data ) = ( split( ",", $line ) ); $usage{$server}{$filename}{value} \|\|= $data[0]; } } for my $file ('File A', 'File B', 'File C') { $usage{$_}{$file}{value} \|\|= 0 for keys %usage; } for my $server (sort keys %usage) { print "$server,", join (',', map {$usage{$server}{$_}{value}} sort keys %{$usage +{$server}}), "\n"; } [download] Prints: `WSOMQAVPRA05,93.75,34.78,93.13 wsomdavpra03,90.39,69.04,68.85 wsomddvfxa01,39.22,92.44,46.97 wsompapgtw05,0,48.77,30.66` [download] Reversion to file based code should be fairly obvious. Lines in arrays makes the code stand alone. Update: Remove averaging code for multiple file entries for a server and use first entry value instead. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Eliminating Duplicate Lines From A CSV File by b4swine (Pilgrim) on Jul 24, 2007 at 22:57 UTC
You actually computed the average in case there were multiple averages in a file, like I was tempted to do. But if you read carefully, the original poster wanted the first number in case there were multiple entries for the same server in one file.	[reply]
Re^2: Eliminating Duplicate Lines From A CSV File by country1 (Acolyte) on Jul 25, 2007 at 17:10 UTC
GrandFather, The CSV files I am using for input are actually much larger (1600 + records each). How can I modify your code to read the 3 input files ("sfull1ns.dat", "sfull2ns.dat", and "sfull3ns.dat") from disk rather than instream?	[reply]
Re^3: Eliminating Duplicate Lines From A CSV File by GrandFather (Saint) on Jul 25, 2007 at 21:03 UTC
Is this what you want? `use strict; use warnings;` [download] Read more... Create the sample files (1090 Bytes) my %usage; my @files = qw(sfull1ns.dat sfull2ns.dat sfull3ns.dat); for my $file ( @files ) { open( my $fh, "<", $file ) or die "Can't open file $file: $!"; <$fh>; # Skip header line while ( my $line = <$fh> ) { chomp($line); my ( $server, @data ) = ( split( ",", $line ) ); $usage{$server}{$file}{value} \|\|= $data[0]; } } for my $file (@files) { $usage{$_}{$file}{value} \|\|= 0 for keys %usage; } for my $server (sort keys %usage) { print "$server,", join (',', map {$usage{$server}{$_}{value}} sort keys %{$usage +{$server}}), "\n"; } [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Eliminating Duplicate Lines From A CSV File by b4swine (Pilgrim) on Jul 24, 2007 at 22:49 UTC
The code as you have written, does not have balanced { braces }, so I can't see how far you got. Nor is it indented in a way that makes clear what loop the continue statement belongs to. Trying to fix it, I have some comments: Better than `my ($server, @data) = (split(“,”,$line));` [download] would be `my ($server, $data) = (split(“,”,$line));` [download] or even better would be `my ($server, $data) = split ',', $line, 2;` [download] Actually this last addition of the 2 argument doesn't make too much difference since perl already optimizes this, and stops without parsing the whole line. Note also that none of the parentheses were needed on the right hand side. Now with a simple scalar data, it is easier to fix the code that you have to skip the first line, so instead of `if ($data[0] lt “!” ) { $data[0] = 0; } next if grep /[^0-9.]/, @data;` [download] you could try `next if $data =~ /[^0-9.]/;` [download] Furthermore, you did a clever `push @{$usage{$server}}, 0 while @{$usage{$server}} < $files; push @{$usage{$server}}, $data[0];` [download] But perl already autoextends arrays, so you can simply write `$usage{$server}->[$files] = $data unless $usage{$server}->[$files]` [download] This will leave all the previous zeros as missing, which you can deal with at the end more easily. BTW, your code will fill in missing zeros, if the are followed by data in later files, but if the missing zeros were in the third file, the arrays would not receive the zeros using your code. At the end you have `continue { $files++ if eof; } close $fh or die "Can’t close file $file: $!";` [download] What is the point? Why not a clean `my $files = 0; for my $file ("sfull1ns.dat","sfull2ns.dat","sfull3ns.dat") { open (my $fh,'<',$file) or die "Can’t open file $file: $!"; while (my $line = <$fh>) { ... } $files++ close $fh or die "Can’t close file $file: $!"; }` [download] And you have no provision for output, obviously you need a final line after the whole looping is done which does `print "$_," . (join ',', @{$usage{$_}}) . "\n" for (keys %usage);` [download] This will of course complain about uninitialized values in the array, but will run. Instead of zeros, the missing values will be missing, but we are nearly there. So now is the time to fix that by putting this code just before the printout. `for (keys %usage) { for my $f (0,1,2) { $usage{$_}->[$f] = 0 unless defined $usage{$_}->[$f]; } }` [download] You may argue that your solution was better, keeping this inside the main loop. It did take less code. But it was confusing in the middle of other processing. Here, to separate it out, in my mind makes cleaner code. I have not put it all together for you, so that you can understand each piece as you implement it, but it did work for me.	[reply] [d/l] [select]
Re: Eliminating Duplicate Lines From A CSV File by Cristoforo (Curate) on Jul 25, 2007 at 03:35 UTC
Taking ideas from Grandfather's code, and some ideas of my own, I have made some changes to your code that should work. I commented some lines you had and replaced them with new code. Mainly the biggest change was postponing the check for a missing average (and assignment of 0) until the printout part of the code (at the end) :-) Chris #!/usr/bin/perl use strict; use warnings; my %usage; my $files = 0; for my $file (qw/sfull1ns.dat sfull2ns.dat sfull3ns.dat/) { open (my $fh,'<',$file) or die "Can’t open file $file: $!"; while (my $line = <$fh>) { next if $line =~ /^Server/; #chomp($line); #my ($server, @data) = (split(",",$line)); # Assigns the first 2 values in the CSV # and discards the rest my ($server, $avg) = (split(",",$line)); #if ($data[0] lt "!" ) { # $data[0] = 0; #} #next if grep /[^0-9.]/, @data; #$usage{$server} = [] unless exists $usage{$server}; #push @{$usage{$server}}, 0 while @{$usage{$server}} < $files; #push @{$usage{$server}}, $data[0]; # if this is a second occurance of a server in the file, # its avg won't be assigned because the first one is already # stored there $usage{$server}[$files] \|\|= $avg; } continue { $files++ if eof; } close $fh or die "Can’t close file $file: $!"; } for my $server (sort keys %usage) { #Either this server has an average (for each element of the array) # or assign 0 to the ones that don't have a value my @avgs = map $usage{$server}[$_] \|\| 0, 0..$#{$usage{$server}}; print join(",", $server, @avgs), "\n"; } [download]	[reply] [d/l]
Re^2: Eliminating Duplicate Lines From A CSV File by country1 (Acolyte) on Jul 25, 2007 at 18:26 UTC
Cristoforo, I tested your perl code. It handles the elimination of rows with duplicate server names in the same file. It handles the replacement of missing values with '0', as long as non-missing values for a server are found in one of the later files. In other words, if the Server does not have a row in the 1st and 2nd file, but does have a row in the 3rd file, the code will produce Server,0,0,12.45 But if the Server has a row in the 1st input file, but not in the 2nd and 3rd, the code will produce Server,2.85,,, or If the server has a row in the 1st and 2nd input file, but not in the 3rd, the code will produce Server,2.85,3.56,, Do you have any ideas why the missing values are not being replaced with '0' when they occur at the end of hash?	[reply]
Re^3: Eliminating Duplicate Lines From A CSV File by Cristoforo (Curate) on Jul 25, 2007 at 19:03 UTC
Yes, that was an error. Also, see now that Grandfather's solution was about the same as mine. He used a Hash of Hash, where I used a Hash of Array. If no recs were found in the last file(s), no array entry was created. Correct line is, `my @avgs = map $usage{$server}[$_] \|\| 0, 0..$files-1;` instead of `my @avgs = map $usage{$server}[$_] \|\| 0, 0..$#{$usage{$server}};` Update: changed 0..$i-1 to 0..$files-1	[reply] [d/l] [select]
Re^4: Eliminating Duplicate Lines From A CSV File by country1 (Acolyte) on Jul 26, 2007 at 17:50 UTC
Re^4: Eliminating Duplicate Lines From A CSV File by country1 (Acolyte) on Jul 25, 2007 at 19:22 UTC