country1 has asked for the wisdom of the Perl Monks concerning the following question:

I have 3 CSV files which I am using as input. They do not necessarily have identical server names nor are they necessarily in the same sort order. Also a particular input file may have more than 1 line with the same server name (see Server wsomddvfxa01 in File A.
File A Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,93.75,95.87,66.67,68.13 wsomdavpra03,90.39,94,65.77,68.51 wsomddvfxa01,39.22,92.19,82.59,88.25 wsomddvfxa01,35.45,89.23,79.89,83.24 File B Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,34.78,100,55.1,67.6 wsomdavpra03,69.04,98.55,84.07,89.73 wsomddvfxa01,92.44,97.54,67.72,71.69 wsompapgtw05,48.77,96.9,92.1,93.55 File C Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,93.13,98.11,68.95,73.47 wsomdavpra03,68.85,97.56,76.35,98.23 wsomddvfxa01,46.97,96.29,88.23,94.02 wsompapgtw05,30.66,93.74,39.89,71.35

What I am trying to do is for each Server (in Column 1) I want to get the Avg CPU (in Column 2) from each of the 3 files. In the files above this would produce

File OUT WSOMQAVPRA05,93.75,34.78,93.13 wsomdavpra03,90.39,69.04,68.85 wsomddvfxa01,39.22,92.44,46.97 wsompapgtw05,0,48.77,30.66

Notice in File OUT for Server Name (wsompapgtw05), since wsompapgtw05 does not appear in File A, the value is replaced with '0'. I was able to do this in my perl code (i.e., place a '0' in the output file when a particular server name appears in at least 1 of the input files, but not in all of the input files).

I have not yet been able to accomplish the 2nd desired task. Notice in File OUT the values for Server Name (wsomddvfxa01) are taken from the first line in File A. I want my perl code to pick out the Avg CPU value from the first data line when there are more than one line for a given Server Name in one of the input CSV data files.

My perl code looks as follows:

#!/usr/bin/perl use strict; use warnings; ################################################ # # Create File with Avg CPU Numbers for each Server # Name # ############################################### my %usage; my $files = 0; for my $file (”sfull1ns.dat”,”sfull2ns.dat”,”sfull3ns.dat”) { open (my $fh,”<”,$file) or die “Can’t open file $file: $!”; while (my $line = <$fh>) { chomp($line); my ($server, @data) = (split(“,”,$line)); if ($data[0] lt “!” ) { $data[0] = 0; } next if grep /[^0-9.]/, @data; $usage{$server} = [] unless exists $usage{$server}; push @{$usage{$server}}, 0 while @{$usage{$server}} + < $files; push @{$usage{$server}}, $data[0]; } continue { $files++ if eof; } close $fh or die “Can’t close file $file: $!”;

Code tags added by GrandFather

Replies are listed 'Best First'.
Re: Eliminating Duplicate Lines From A CSV File
by GrandFather (Saint) on Jul 24, 2007 at 21:58 UTC
    use strict; use warnings; my %usage; my @sfull1ns = split /\n/, <<'FILE'; Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,93.75,95.87,66.67,68.13 wsomdavpra03,90.39,94,65.77,68.51 wsomddvfxa01,39.22,92.19,82.59,88.25 wsomddvfxa01,35.45,89.23,79.89,83.24 FILE my @sfull2ns = split /\n/, <<FILE; Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,34.78,100,55.1,67.6 wsomdavpra03,69.04,98.55,84.07,89.73 wsomddvfxa01,92.44,97.54,67.72,71.69 wsompapgtw05,48.77,96.9,92.1,93.55 FILE my @sfull3ns = split /\n/, <<FILE; Server,Avg CPU,P95 CPU,Avg Mem Util,P95 Mem Util WSOMQAVPRA05,93.13,98.11,68.95,73.47 wsomdavpra03,68.85,97.56,76.35,98.23 wsomddvfxa01,46.97,96.29,88.23,94.02 wsompapgtw05,30.66,93.74,39.89,71.35 FILE for my $fileData (['File A', \@sfull1ns], ['File B', \@sfull2ns], ['Fi +le C', \@sfull3ns],) { my ($filename, $data) = @$fileData; shift @$data; for my $line (@$data) { chomp($line); my ( $server, @data ) = ( split( ",", $line ) ); $usage{$server}{$filename}{value} ||= $data[0]; } } for my $file ('File A', 'File B', 'File C') { $usage{$_}{$file}{value} ||= 0 for keys %usage; } for my $server (sort keys %usage) { print "$server,", join (',', map {$usage{$server}{$_}{value}} sort keys %{$usage +{$server}}), "\n"; }

    Prints:

    WSOMQAVPRA05,93.75,34.78,93.13 wsomdavpra03,90.39,69.04,68.85 wsomddvfxa01,39.22,92.44,46.97 wsompapgtw05,0,48.77,30.66

    Reversion to file based code should be fairly obvious. Lines in arrays makes the code stand alone.

    Update: Remove averaging code for multiple file entries for a server and use first entry value instead.


    DWIM is Perl's answer to Gödel
      You actually computed the average in case there were multiple averages in a file, like I was tempted to do. But if you read carefully, the original poster wanted the first number in case there were multiple entries for the same server in one file.

      GrandFather,


      The CSV files I am using for input are actually much
      larger (1600 + records each). How can I modify your
      code to read the 3 input files ("sfull1ns.dat",
      "sfull2ns.dat", and "sfull3ns.dat") from disk rather
      than instream?

        Is this what you want?

        use strict; use warnings;
        my %usage; my @files = qw(sfull1ns.dat sfull2ns.dat sfull3ns.dat); for my $file ( @files ) { open( my $fh, "<", $file ) or die "Can't open file $file: $!"; <$fh>; # Skip header line while ( my $line = <$fh> ) { chomp($line); my ( $server, @data ) = ( split( ",", $line ) ); $usage{$server}{$file}{value} ||= $data[0]; } } for my $file (@files) { $usage{$_}{$file}{value} ||= 0 for keys %usage; } for my $server (sort keys %usage) { print "$server,", join (',', map {$usage{$server}{$_}{value}} sort keys %{$usage +{$server}}), "\n"; }

        DWIM is Perl's answer to Gödel
Re: Eliminating Duplicate Lines From A CSV File
by b4swine (Pilgrim) on Jul 24, 2007 at 22:49 UTC
    The code as you have written, does not have balanced { braces }, so I can't see how far you got. Nor is it indented in a way that makes clear what loop the continue statement belongs to. Trying to fix it, I have some comments: Better than
    my ($server, @data) = (split(“,”,$line));
    would be
    my ($server, $data) = (split(“,”,$line));
    or even better would be
    my ($server, $data) = split ',', $line, 2;
    Actually this last addition of the 2 argument doesn't make too much difference since perl already optimizes this, and stops without parsing the whole line. Note also that none of the parentheses were needed on the right hand side.

    Now with a simple scalar data, it is easier to fix the code that you have to skip the first line, so instead of

    if ($data[0] lt “!” ) { $data[0] = 0; } next if grep /[^0-9.]/, @data;
    you could try
    next if $data =~ /[^0-9.]/;
    Furthermore, you did a clever
    push @{$usage{$server}}, 0 while @{$usage{$server}} < $files; push @{$usage{$server}}, $data[0];
    But perl already autoextends arrays, so you can simply write
    $usage{$server}->[$files] = $data unless $usage{$server}->[$files]
    This will leave all the previous zeros as missing, which you can deal with at the end more easily. BTW, your code will fill in missing zeros, if the are followed by data in later files, but if the missing zeros were in the third file, the arrays would not receive the zeros using your code.

    At the end you have

    continue { $files++ if eof; } close $fh or die "Can’t close file $file: $!";
    What is the point? Why not a clean
    my $files = 0; for my $file ("sfull1ns.dat","sfull2ns.dat","sfull3ns.dat") { open (my $fh,'<',$file) or die "Can’t open file $file: $!"; while (my $line = <$fh>) { ... } $files++ close $fh or die "Can’t close file $file: $!"; }

    And you have no provision for output, obviously you need a final line after the whole looping is done which does

    print "$_," . (join ',', @{$usage{$_}}) . "\n" for (keys %usage);

    This will of course complain about uninitialized values in the array, but will run. Instead of zeros, the missing values will be missing, but we are nearly there. So now is the time to fix that by putting this code just before the printout.

    for (keys %usage) { for my $f (0,1,2) { $usage{$_}->[$f] = 0 unless defined $usage{$_}->[$f]; } }
    You may argue that your solution was better, keeping this inside the main loop. It did take less code. But it was confusing in the middle of other processing. Here, to separate it out, in my mind makes cleaner code.

    I have not put it all together for you, so that you can understand each piece as you implement it, but it did work for me.

Re: Eliminating Duplicate Lines From A CSV File
by Cristoforo (Curate) on Jul 25, 2007 at 03:35 UTC
    Taking ideas from Grandfather's code, and some ideas of my own, I have made some changes to your code that should work. I commented some lines you had and replaced them with new code. Mainly the biggest change was postponing the check for a missing average (and assignment of 0) until the printout part of the code (at the end) :-)
    Chris
    #!/usr/bin/perl use strict; use warnings; my %usage; my $files = 0; for my $file (qw/sfull1ns.dat sfull2ns.dat sfull3ns.dat/) { open (my $fh,'<',$file) or die "Can’t open file $file: $!"; while (my $line = <$fh>) { next if $line =~ /^Server/; #chomp($line); #my ($server, @data) = (split(",",$line)); # Assigns the first 2 values in the CSV # and discards the rest my ($server, $avg) = (split(",",$line)); #if ($data[0] lt "!" ) { # $data[0] = 0; #} #next if grep /[^0-9.]/, @data; #$usage{$server} = [] unless exists $usage{$server}; #push @{$usage{$server}}, 0 while @{$usage{$server}} < $files; #push @{$usage{$server}}, $data[0]; # if this is a second occurance of a server in the file, # its avg won't be assigned because the first one is already # stored there $usage{$server}[$files] ||= $avg; } continue { $files++ if eof; } close $fh or die "Can’t close file $file: $!"; } for my $server (sort keys %usage) { #Either this server has an average (for each element of the array) # or assign 0 to the ones that don't have a value my @avgs = map $usage{$server}[$_] || 0, 0..$#{$usage{$server}}; print join(",", $server, @avgs), "\n"; }

      Cristoforo,


      I tested your perl code. It handles the elimination
      of rows with duplicate server names in the same file.


      It handles the replacement of missing values with '0',
      as long as non-missing values for a server are found
      in one of the later files.


      In other words, if the Server does not have a row in
      the 1st and 2nd file, but does have a row in the 3rd
      file, the code will produce


      Server,0,0,12.45


      But if the Server has a row in the 1st input file, but
      not in the 2nd and 3rd, the code will produce


      Server,2.85,,,


      or


      If the server has a row in the 1st and 2nd input file,
      but not in the 3rd, the code will produce


      Server,2.85,3.56,,


      Do you have any ideas why the missing values are not
      being replaced with '0' when they occur at the end of
      hash?

        Yes, that was an error. Also, see now that Grandfather's solution was about the same as mine. He used a Hash of Hash, where I used a Hash of Array.
        If no recs were found in the last file(s), no array entry was created.

        Correct line is,
        my @avgs = map $usage{$server}[$_] || 0, 0..$files-1;
        instead of
        my @avgs = map $usage{$server}[$_] || 0, 0..$#{$usage{$server}}; Update: changed 0..$i-1 to 0..$files-1