Perl Sum & Count column based data

ler224 has asked for the wisdom of the Perl Monks concerning the following question:

I am having difficulty manipulating column based data using Perl in an AWK/Excel like way. There are some posts on stackoverflow that are similar that I am looking to do but not exactly. This is the closest example: http://stackoverflow.com/a/9179682/2379629 I am trying to sum and count multiple (many & variable) columns in Perl based on unique values in a "base" column. For example if this is the data set, I may want to use either the first column or second column as the "base" column. I then want to count & sum column's (fields) 3 & 4 for each value in the "Base" column. But I dont only want to do this for columns 3 & 4. I want the column numbers to be variable based on my input data set & my defined columns. here is an example:

__DATA__
U1     ID1     100     280
U1     ID1     137     250
U2     ID2     150     375
U1     ID2     100     100
U3     ID1     100     600
U9     ID3     137     200
[download]

example output: (Field[0])
U1     3     337     3     630     
U2     1     150     1     375
U3     1     100     1     600
U9     1     137     1     200
[download]

example output: (Field[1])
ID1     3     300     3     1130
ID2     2     250     2     475
ID3     1     137     1     200
[download]

I could do that by modifying: $fields[0] to 0 or 1. But I cannot figure out how to dynamically set the number of columns. I believe I may need something like this below, where I can loop for each column. And chose the columns (fields) that I want to use in an @column = (1,2,10,20).

$res{$fields[0]}{$column}{instance}++;
$res{$fields[0]}{$column}{volume} += $fields[$column];
[download]

I want to avoid the below, because then I need to write 2 lines of code for each column that I want to manipulate and it would make it harder to change the columns that I want.

$res{$fields[0]}{instance}++;
$res{$fields[0]}{volume} += $fields[1];
$res{$fields[0]}{instance2}= (<insert code to count field[2]>)
$res{$fields[0]}{volume2} += $fields[2];
[download]

Comment on Perl Sum & Count column based data Select or Download Code

Replies are listed 'Best First'.
Re: Perl Sum & Count column based data by NetWallah (Canon) on Nov 15, 2013 at 04:05 UTC
More scalable (Less memory), but with a less pretty output than idg's solution (Also closer to the specification - base column is a parameter): use strict; use warnings; my $basecol=shift @ARGV or die "WHat base col?"; my %h; $basecol--; # zero based inside program. while (<DATA>){ chomp; my @f= split ; next unless @f > 1; my $k = $f[$basecol]; $h{$k}{COUNT}++; for my $col(0..$#f){ next if $col==$basecol; $f[$col]=0 unless $f[$col]=~/^\d+$/; $h{$k}{FIELD}[$col]+= $f[$col]; } } for (sort keys %h){ print "$_\t $h{$_}{COUNT}\t"; for my $f(@{ $h{$_}{FIELD} }){ defined $f or $f=''; print "$f\t"; } print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200 [download] When in doubt, mumble; when in trouble, delegate; when in charge, ponder. -- James H. Boren	[reply] [d/l]
Re^2: Perl Sum & Count column based data by Anonymous Monk on Nov 21, 2013 at 18:22 UTC
I have been using some of this syntax. I was trying to also output a unique count of the numbers in the 3rd & 4th columns. I was trying to push the value's in the column onto an array, if it already is not already in the array. I think this is currently a scalar value but i would need an array/list. This is the error that I receive: Not an ARRAY reference at ./test.pl line 21, <DATA> line 1. I am thinking once I have the values in an array or list I can just print the list in scalar context to get the unique count? I don't know if there is a simpler way to achieve this. #!/usr/bin/perl use strict; use warnings; my $basecol=shift @ARGV or die "What base col?"; my %h; $basecol--; # zero based inside program. while (<DATA>){ chomp; my @f= split ; next unless @f > 1; splice @f,0,0,$f[0] . "_". $f[1]; my $k = $f[$basecol]; $h{$k}{COUNT}++; for my $col(0..$#f){ next if $col==$basecol; $f[$col]=0 unless $f[$col]=~/^\d+$/; $h{$k}{FIELD_S}[$col]+= $f[$col]; push($h{$k}{FIELD_C}[$col],$f[$col]) if !($f[$col] ~~ $h{$k}{FIELD +_C}[$col]); } } for (sort keys %h){ print "$_\t $h{$_}{COUNT}\t"; for my $f(@{ $h{$_}{FIELD_S} }){ defined $f or $f=''; print "$f\t"; } print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200 [download]	[reply] [d/l]
Re: Perl Sum & Count column based data by Kenosis (Priest) on Nov 15, 2013 at 04:17 UTC
Perhaps the following will be helpful: use strict; use warnings; my $base = 0; my @cols = qw/2 3/; my %hash; while ( my $line = <DATA> ) { my @vals = split ' ', $line; $hash{ $vals[$base] }[$_] += $vals[ $cols[$_] ] for 0 .. $#cols; $hash{ $vals[$base] }[ $#cols + 1 ]++; } for my $key ( sort keys %hash ) { print "$key"; print "\t$hash{$key}[ $#cols + 1 ]\t$hash{$key}[$_]" for 0 .. $#co +ls; print "\n"; } __DATA__ U1 ID1 100 280 30 350 U1 ID1 137 250 70 200 U2 ID2 150 375 400 50 U1 ID2 100 100 100 375 U3 ID1 100 600 50 100 U9 ID3 137 200 75 150 [download] Output when `$col = 0; @cols = qw/2 3/`: `U1 3 337 3 630 U2 1 150 1 375 U3 1 100 1 600 U9 1 137 1 200` [download] Output when `$col = 1; @cols = qw/2 4 5/`: `ID1 3 337 3 150 3 650 ID2 2 250 2 500 2 425 ID3 1 137 1 75 1 150` [download] The `while` loop builds a data structure like the following, which is then later `print`ed in the `for` loop: `$VAR1 = { 'U1' => [ 337, 630, 3 ], 'U2' => [ 150, 375, 1 ], 'U9' => [ 137, 200, 1 ], 'U3' => [ 100, 600, 1 ] };` [download] Edit: Updated to more closely meet the OP's scalability specs.	[reply] [d/l] [select]
Re: Perl Sum & Count column based data by aaron_baugher (Curate) on Nov 15, 2013 at 04:34 UTC
This will let you specify a "Base" column as the first argument on the command line, followed by any number of other columns to count and sum (the first column is numbered zero, as in your example): abaugher@cail> cat 1062685.pl #!/usr/bin/env perl use strict; use warnings; my $key = shift; my @cols = @ARGV; my %h; die "Usage: 1062685.pl id_column field1 [field2]...\n" unless @cols; # $ARGV[0] is the Base column # $ARGV[1..x] is the list of columns to add up while(<DATA>){ chomp; my @f = split; for (@cols){ $h{$f[$key]}{$_}{t} += $f[$_]; $h{$f[$key]}{$_}{n}++; } } for my $k (sort keys %h){ print $k; print "\t$h{$k}{$_}{n}\t$h{$k}{$_}{t}" for @cols; print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200 abaugher@cail> perl 1062685.pl 1 2 3 ID1 3 337 3 1130 ID2 2 250 2 475 ID3 1 137 1 200 abaugher@cail> perl 1062685.pl 0 2 3 U1 3 337 3 630 U2 1 150 1 375 U3 1 100 1 600 U9 1 137 1 200 [download] Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re: Perl Sum & Count column based data by hdb (Monsignor) on Nov 15, 2013 at 07:12 UTC
Have a look at Data::Table and see whether it can do what you want: use strict; use warnings; use Data::Table; use List::Util qw/sum/; my $data = [ map { chomp; [ split /\s+/ ] } <DATA> ]; # read data i +nto array of arrays my $header = [ map { "col$_" } 0..@{$data->[0]}-1 ]; # create some + headers for the data my $t = new Data::Table( $data, $header, 0 ); my $summary = $t->group( [ "col0" ], + # base column [ "col2", "col2", "col3", "col3" ], + # fields [ sub{scalar @_}, sub{sum @_}, sub{scalar @_} +, sub{sum @_}, ], # aggregation functions [ "Count2", "Sum2", "Count3", "Sum3" ], + # names of result columns 0 ); + # do not display other columns print $summary->tsv; # print results in tabular form __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200 [download] Update: Added suppression of other columns	[reply] [d/l]
Re: Perl Sum & Count column based data by ig (Vicar) on Nov 15, 2013 at 03:54 UTC
You say you are having difficulty, but you don't say what the difficulty is very clearly, making it difficult to help. Perhaps the following will give you some ideas. Otherwise, you might be a bit more specific about what the difficulty is. #!C:/strawberry/perl/bin/perl.exe # # See http://rick.measham.id.au/pdf-api2/ # use strict; use warnings; use Data::Dumper::Concise; my @records; while(my $record = <DATA>) { push(@records, [ split(/\s+/, $record) ]); } report({ group_by => 0, sum_columns => [ 2, 3 ], records => \@records, }); report({ group_by => 1, sum_columns => [ 2, 3 ], records => \@records, }); report({ group_by => 2, sum_columns => [ 3 ], records => \@records, }); exit(0); sub report { my ($args) = @_; # sum columns by group my %groups; foreach my $record (@{$args->{records}}) { my $group = $groups{$record->[$args->{group_by}]} //= {}; $group->{count}++; foreach my $sum_column (@{$args->{sum_columns}}) { $group->{sum}->{$sum_column} += $record->[$sum_column]; } } # produce output print "example output: group by column $args->{group_by}, sum colu +mns " . join(', ', @{$args->{sum_columns}}) . "\n"; foreach my $group (sort keys %groups) { print join(' ', $group, map { ( $groups{$group}->{count}, $groups{$group}->{sum}->{$_} ) } @{$args->{sum_columns}} ) . "\n";; } print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200 [download] Which produces: `example output: group by column 0, sum columns 2, 3 U1 3 337 3 630 U2 1 150 1 375 U3 1 100 1 600 U9 1 137 1 200 example output: group by column 1, sum columns 2, 3 ID1 3 337 3 1130 ID2 2 250 2 475 ID3 1 137 1 200 example output: group by column 2, sum columns 3 100 3 980 137 2 450 150 1 375` [download]	[reply] [d/l] [select]
Re: Perl Sum & Count column based data by CountZero (Bishop) on Nov 15, 2013 at 07:08 UTC
Dump the data into a database and use standard SQL queries to get your sums and counts. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: Perl Sum & Count column based data by Anonymous Monk on Nov 15, 2013 at 03:30 UTC
Can you post runnable code?	[reply]
Re: Perl Sum & Count column based data by ler224 (Acolyte) on Nov 15, 2013 at 06:03 UTC
All of these responses look like they will work. Thank you all. I will run them all through perl to understand the syntax better.	[reply]
Re: Perl Sum & Count column based data by ler224 (Acolyte) on Nov 21, 2013 at 19:16 UTC
I have been using some of this syntax. I was trying to also output a unique count of the numbers in the 3rd & 4th columns. I was trying to push the value's in the column onto an array, if it already is not already in the array. I think this is currently a scalar value but i would need an array/list. This is the error that I receive: Not an ARRAY reference at ./test.pl line 21, <DATA> line 1. I am thinking once I have the values in an array or list I can just print the list in scalar context to get the unique count? I don't know if there is a simpler way to achieve this. #!/usr/bin/perl use strict; use warnings; my $basecol=shift @ARGV or die "What base col?"; my %h; $basecol--; # zero based inside program. while (<DATA>){ chomp; my @f= split ; next unless @f > 1; splice @f,0,0,$f[0] . "_". $f[1]; my $k = $f[$basecol]; $h{$k}{COUNT}++; for my $col(0..$#f){ next if $col==$basecol; $f[$col]=0 unless $f[$col]=~/^\d+$/; $h{$k}{FIELD_S}[$col]+= $f[$col]; push($h{$k}{FIELD_C}[$col],$f[$col]) if !($f[$col] ~~ $h{$k}{FIELD +_C}[$col]); } } for (sort keys %h){ print "$_\t $h{$_}{COUNT}\t"; for my $f(@{ $h{$_}{FIELD_S} }){ defined $f or $f=''; print "$f\t"; } print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200 [download]	[reply] [d/l]


Don't ask to ask, just ask
	PerlMonks