Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Perl Sum & Count column based data

by ler224 (Acolyte)
on Nov 15, 2013 at 02:53 UTC ( [id://1062685]=perlquestion: print w/replies, xml ) Need Help??

ler224 has asked for the wisdom of the Perl Monks concerning the following question:

I am having difficulty manipulating column based data using Perl in an AWK/Excel like way. There are some posts on stackoverflow that are similar that I am looking to do but not exactly. This is the closest example: http://stackoverflow.com/a/9179682/2379629 I am trying to sum and count multiple (many & variable) columns in Perl based on unique values in a "base" column. For example if this is the data set, I may want to use either the first column or second column as the "base" column. I then want to count & sum column's (fields) 3 & 4 for each value in the "Base" column. But I dont only want to do this for columns 3 & 4. I want the column numbers to be variable based on my input data set & my defined columns. here is an example:
__DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200
example output: (Field[0]) U1 3 337 3 630 U2 1 150 1 375 U3 1 100 1 600 U9 1 137 1 200
or
example output: (Field[1]) ID1 3 300 3 1130 ID2 2 250 2 475 ID3 1 137 1 200
I could do that by modifying: $fields[0] to 0 or 1. But I cannot figure out how to dynamically set the number of columns. I believe I may need something like this below, where I can loop for each column. And chose the columns (fields) that I want to use in an @column = (1,2,10,20).
$res{$fields[0]}{$column}{instance}++; $res{$fields[0]}{$column}{volume} += $fields[$column];
I want to avoid the below, because then I need to write 2 lines of code for each column that I want to manipulate and it would make it harder to change the columns that I want.
$res{$fields[0]}{instance}++; $res{$fields[0]}{volume} += $fields[1]; $res{$fields[0]}{instance2}= (<insert code to count field[2]>) $res{$fields[0]}{volume2} += $fields[2];

Replies are listed 'Best First'.
Re: Perl Sum & Count column based data
by NetWallah (Canon) on Nov 15, 2013 at 04:05 UTC
    More scalable (Less memory), but with a less pretty output than idg's solution (Also closer to the specification - base column is a parameter):
    use strict; use warnings; my $basecol=shift @ARGV or die "WHat base col?"; my %h; $basecol--; # zero based inside program. while (<DATA>){ chomp; my @f= split ; next unless @f > 1; my $k = $f[$basecol]; $h{$k}{COUNT}++; for my $col(0..$#f){ next if $col==$basecol; $f[$col]=0 unless $f[$col]=~/^\d+$/; $h{$k}{FIELD}[$col]+= $f[$col]; } } for (sort keys %h){ print "$_\t $h{$_}{COUNT}\t"; for my $f(@{ $h{$_}{FIELD} }){ defined $f or $f=''; print "$f\t"; } print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200

                 When in doubt, mumble; when in trouble, delegate; when in charge, ponder. -- James H. Boren

      I have been using some of this syntax. I was trying to also output a unique count of the numbers in the 3rd & 4th columns. I was trying to push the value's in the column onto an array, if it already is not already in the array. I think this is currently a scalar value but i would need an array/list. This is the error that I receive: Not an ARRAY reference at ./test.pl line 21, <DATA> line 1. I am thinking once I have the values in an array or list I can just print the list in scalar context to get the unique count? I don't know if there is a simpler way to achieve this.
      #!/usr/bin/perl use strict; use warnings; my $basecol=shift @ARGV or die "What base col?"; my %h; $basecol--; # zero based inside program. while (<DATA>){ chomp; my @f= split ; next unless @f > 1; splice @f,0,0,$f[0] . "_". $f[1]; my $k = $f[$basecol]; $h{$k}{COUNT}++; for my $col(0..$#f){ next if $col==$basecol; $f[$col]=0 unless $f[$col]=~/^\d+$/; $h{$k}{FIELD_S}[$col]+= $f[$col]; push($h{$k}{FIELD_C}[$col],$f[$col]) if !($f[$col] ~~ $h{$k}{FIELD +_C}[$col]); } } for (sort keys %h){ print "$_\t $h{$_}{COUNT}\t"; for my $f(@{ $h{$_}{FIELD_S} }){ defined $f or $f=''; print "$f\t"; } print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200
Re: Perl Sum & Count column based data
by Kenosis (Priest) on Nov 15, 2013 at 04:17 UTC

    Perhaps the following will be helpful:

    use strict; use warnings; my $base = 0; my @cols = qw/2 3/; my %hash; while ( my $line = <DATA> ) { my @vals = split ' ', $line; $hash{ $vals[$base] }[$_] += $vals[ $cols[$_] ] for 0 .. $#cols; $hash{ $vals[$base] }[ $#cols + 1 ]++; } for my $key ( sort keys %hash ) { print "$key"; print "\t$hash{$key}[ $#cols + 1 ]\t$hash{$key}[$_]" for 0 .. $#co +ls; print "\n"; } __DATA__ U1 ID1 100 280 30 350 U1 ID1 137 250 70 200 U2 ID2 150 375 400 50 U1 ID2 100 100 100 375 U3 ID1 100 600 50 100 U9 ID3 137 200 75 150

    Output when $col = 0; @cols = qw/2 3/:

    U1 3 337 3 630 U2 1 150 1 375 U3 1 100 1 600 U9 1 137 1 200

    Output when $col = 1; @cols = qw/2 4 5/:

    ID1 3 337 3 150 3 650 ID2 2 250 2 500 2 425 ID3 1 137 1 75 1 150

    The while loop builds a data structure like the following, which is then later printed in the for loop:

    $VAR1 = { 'U1' => [ 337, 630, 3 ], 'U2' => [ 150, 375, 1 ], 'U9' => [ 137, 200, 1 ], 'U3' => [ 100, 600, 1 ] };

    Edit: Updated to more closely meet the OP's scalability specs.

Re: Perl Sum & Count column based data
by aaron_baugher (Curate) on Nov 15, 2013 at 04:34 UTC

    This will let you specify a "Base" column as the first argument on the command line, followed by any number of other columns to count and sum (the first column is numbered zero, as in your example):

    abaugher@cail> cat 1062685.pl #!/usr/bin/env perl use strict; use warnings; my $key = shift; my @cols = @ARGV; my %h; die "Usage: 1062685.pl id_column field1 [field2]...\n" unless @cols; # $ARGV[0] is the Base column # $ARGV[1..x] is the list of columns to add up while(<DATA>){ chomp; my @f = split; for (@cols){ $h{$f[$key]}{$_}{t} += $f[$_]; $h{$f[$key]}{$_}{n}++; } } for my $k (sort keys %h){ print $k; print "\t$h{$k}{$_}{n}\t$h{$k}{$_}{t}" for @cols; print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200 abaugher@cail> perl 1062685.pl 1 2 3 ID1 3 337 3 1130 ID2 2 250 2 475 ID3 1 137 1 200 abaugher@cail> perl 1062685.pl 0 2 3 U1 3 337 3 630 U2 1 150 1 375 U3 1 100 1 600 U9 1 137 1 200

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

Re: Perl Sum & Count column based data
by hdb (Monsignor) on Nov 15, 2013 at 07:12 UTC

    Have a look at Data::Table and see whether it can do what you want:

    use strict; use warnings; use Data::Table; use List::Util qw/sum/; my $data = [ map { chomp; [ split /\s+/ ] } <DATA> ]; # read data i +nto array of arrays my $header = [ map { "col$_" } 0..@{$data->[0]}-1 ]; # create some + headers for the data my $t = new Data::Table( $data, $header, 0 ); my $summary = $t->group( [ "col0" ], + # base column [ "col2", "col2", "col3", "col3" ], + # fields [ sub{scalar @_}, sub{sum @_}, sub{scalar @_} +, sub{sum @_}, ], # aggregation functions [ "Count2", "Sum2", "Count3", "Sum3" ], + # names of result columns 0 ); + # do not display other columns print $summary->tsv; # print results in tabular form __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200

    Update: Added suppression of other columns

Re: Perl Sum & Count column based data
by ig (Vicar) on Nov 15, 2013 at 03:54 UTC

    You say you are having difficulty, but you don't say what the difficulty is very clearly, making it difficult to help.

    Perhaps the following will give you some ideas. Otherwise, you might be a bit more specific about what the difficulty is.

    #!C:/strawberry/perl/bin/perl.exe # # See http://rick.measham.id.au/pdf-api2/ # use strict; use warnings; use Data::Dumper::Concise; my @records; while(my $record = <DATA>) { push(@records, [ split(/\s+/, $record) ]); } report({ group_by => 0, sum_columns => [ 2, 3 ], records => \@records, }); report({ group_by => 1, sum_columns => [ 2, 3 ], records => \@records, }); report({ group_by => 2, sum_columns => [ 3 ], records => \@records, }); exit(0); sub report { my ($args) = @_; # sum columns by group my %groups; foreach my $record (@{$args->{records}}) { my $group = $groups{$record->[$args->{group_by}]} //= {}; $group->{count}++; foreach my $sum_column (@{$args->{sum_columns}}) { $group->{sum}->{$sum_column} += $record->[$sum_column]; } } # produce output print "example output: group by column $args->{group_by}, sum colu +mns " . join(', ', @{$args->{sum_columns}}) . "\n"; foreach my $group (sort keys %groups) { print join(' ', $group, map { ( $groups{$group}->{count}, $groups{$group}->{sum}->{$_} ) } @{$args->{sum_columns}} ) . "\n";; } print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200

    Which produces:

    example output: group by column 0, sum columns 2, 3 U1 3 337 3 630 U2 1 150 1 375 U3 1 100 1 600 U9 1 137 1 200 example output: group by column 1, sum columns 2, 3 ID1 3 337 3 1130 ID2 2 250 2 475 ID3 1 137 1 200 example output: group by column 2, sum columns 3 100 3 980 137 2 450 150 1 375
Re: Perl Sum & Count column based data
by CountZero (Bishop) on Nov 15, 2013 at 07:08 UTC
    Dump the data into a database and use standard SQL queries to get your sums and counts.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Perl Sum & Count column based data
by Anonymous Monk on Nov 15, 2013 at 03:30 UTC
    Can you post runnable code?
Re: Perl Sum & Count column based data
by ler224 (Acolyte) on Nov 15, 2013 at 06:03 UTC
    All of these responses look like they will work. Thank you all. I will run them all through perl to understand the syntax better.
Re: Perl Sum & Count column based data
by ler224 (Acolyte) on Nov 21, 2013 at 19:16 UTC
    I have been using some of this syntax. I was trying to also output a unique count of the numbers in the 3rd & 4th columns. I was trying to push the value's in the column onto an array, if it already is not already in the array. I think this is currently a scalar value but i would need an array/list. This is the error that I receive: Not an ARRAY reference at ./test.pl line 21, <DATA> line 1. I am thinking once I have the values in an array or list I can just print the list in scalar context to get the unique count? I don't know if there is a simpler way to achieve this.
    #!/usr/bin/perl use strict; use warnings; my $basecol=shift @ARGV or die "What base col?"; my %h; $basecol--; # zero based inside program. while (<DATA>){ chomp; my @f= split ; next unless @f > 1; splice @f,0,0,$f[0] . "_". $f[1]; my $k = $f[$basecol]; $h{$k}{COUNT}++; for my $col(0..$#f){ next if $col==$basecol; $f[$col]=0 unless $f[$col]=~/^\d+$/; $h{$k}{FIELD_S}[$col]+= $f[$col]; push($h{$k}{FIELD_C}[$col],$f[$col]) if !($f[$col] ~~ $h{$k}{FIELD +_C}[$col]); } } for (sort keys %h){ print "$_\t $h{$_}{COUNT}\t"; for my $f(@{ $h{$_}{FIELD_S} }){ defined $f or $f=''; print "$f\t"; } print "\n"; } __DATA__ U1 ID1 100 280 U1 ID1 137 250 U2 ID2 150 375 U1 ID2 100 100 U3 ID1 100 600 U9 ID3 137 200

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1062685]
Approved by ig
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-04-18 16:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found