Aggregating the column based on the common column values

kanskr has asked for the wisdom of the Perl Monks concerning the following question:

My data is like below

Username | Roles | Type |date |SR NO|Remarks|
abc|admin |added | 01072015
abc|developer |deleted |01072015
abc|deploy |added |01072015
xyz |admin |deleted |01072015
xyz| deploy|deleted|01072015
cdf|deploy|added|01072015
[download]

and i need to do the following a) i need to group based on the user, roles and concatenate the roles filed with the delimiter "," b) I need to create the output as below.

Username |roles added |roles deleted |date
username |roles_added |roles deleted |date
abc |admin,deploy |developer |01072015 
xyz ||admin,deploy |01072015
cdf |deploy||01072015
[download]

In order to achieve this,am using the code below

 use strict;
use warnings;
use diagnostics;
use Data::Dumper;

open(FIL,"report.txt") or die("$!");

my %k=();

while (my $line=<FIL>) {
      next if $. < 2;
        my ($user,$roles,$type,$dt,$empty1,$empty2)=split(/\|/,$line);

        push @{$k{$user}{$type}}, $roles;

}

my @names=(sort keys(%k));



foreach ( @names) {
   if ( @{$k{$_}{Added}} ne ''){
   print "${k{user}}\n";
   print   join ',', @{$k{$_}{Added}};
   print "\n";
   }
   if ( @{$k{$_}{Deleted}} ne '') {

   print join ',', @{$k{$_}{Deleted}};
   print "\n";
        }
}
[download]

The script fails with below error Can't use an undefined value as an ARRAY reference at report3.pl line 34, <FIL> line 23. at report3.pl line 34 Please help me how i can generate the output as expected. Many thanks

Comment on Aggregating the column based on the common column values Select or Download Code

Replies are listed 'Best First'.
Re: Aggregating the column based on the common column values by kevbot (Vicar) on Jul 03, 2015 at 05:55 UTC
Hello kanskr, I noticed a few issues with your code, see my edits/additions below. I put your example data in a file called report.txt and tested your script using it. I get the same errors. I printed your hash using Data::Dumper, which made troubleshooting easier. However, it can still be difficult to see problematic extra white space. The input data you provided seems to have random spaces before or after some fields. You need to ensure that your input data is more consistently formatted, or you need to adjust your code to tolerate the extra spaces. Using, `split(/\s\\|\s/, $line)` will remove the spaces from your parsed data. In your foreach loop, you are attempting to use hash keys of Added and Deleted; however, your data (and therefore your hash) contains added and deleted. The case of your string matters. You are getting the Can't use an undefined value as an ARRAY reference error, because the added and/or deleted hash key does not exist for some usernames. So, instead of checking for empty values like this: `@{$k{$_}{added}} ne ''` I suggest that you check that the value exists, like this: `exists($k{$_}{added})` This line appears to contain typos: `print "${k{user}}\n";` so I removed it from the code. This code, use strict; use warnings; use diagnostics; use Data::Dumper; open(FIL,"report.txt") or die("$!"); my %k=(); while (my $line=<FIL>) { next if $. < 2; my ($user,$roles,$type,$dt,$empty1,$empty2)=split(/\s\\|\s/,$ +line); push @{$k{$user}{$type}}, $roles; } my @names = sort keys(%k); # Dumping the reference to the hash to more clearly show its structure + print Dumper \%k; foreach ( @names ) { print "Name: ", $_, "\n"; if( exists($k{$_}{added}) ){ print "\tAdded:\n\t\t"; print join ',', @{$k{$_}{added}}; print "\n"; } if ( exists($k{$_}{deleted}) ) { print "\tDeleted:\n\t\t"; print join ',', @{$k{$_}{deleted}}; print "\n"; } } [download] gives the following output: `$VAR1 = { 'cdf' => { 'added' => [ 'deploy' ] }, 'xyz' => { 'deleted' => [ 'admin', 'deploy' ] }, 'abc' => { 'added' => [ 'admin', 'deploy' ], 'deleted' => [ 'developer' ] } }; Name: abc Added: admin,deploy Deleted: developer Name: cdf Added: deploy Name: xyz Deleted: admin,deploy` [download]	[reply] [d/l] [select]
Re: Aggregating the column based on the common column values by kcott (Archbishop) on Jul 03, 2015 at 13:11 UTC
G'day kanskr, Welcome to the Monastery. Your "`undefined value as an ARRAY reference`" errors come from code like '`@{$k{$_}{Added}}`'. None of your (shown) input data has types of `'Added'` or `'Deleted'`, so those keys won't exist: the types you do have are: `'added'` and `'deleted'`. However, not all users have both of those types so, again, some keys won't exist. You can get around that by changing '`@{$k{$_}{$type}}`' to '`@{$k{$_}{$type} \|\| []}`', see my code below for examples of this. You're using string comparison tests ('`ne`') against arrays (e.g. `@{...} ne ''`). Don't do that! You only need to test like this: '`if (@{...})`'. If '`@{...}`' evaluates to an empty list, it's FALSE; otherwise, it's TRUE: see "perlsyn: Truth and Falsehood". Your input data is a ~~tab~~pipe-separated file (with odd bits of whitespace floating around in it). Unless you're doing this for reasons of your own (education, entertainment, whatever), there's really no value in reinventing this wheel. It's already been coded for you: Text::CSV. I show an example of how you might use this, below. Your example input shows only one date throughout: I suspect that's a poor example and your real data has different dates (but do correct me if I'm wrong). In my example code below, I've given an example of how you might deal with different dates; and I've changed the input so every "Role" action has a different date. What I have there may be unsuitable for you but, as you don't even address the "Date" field in the code you've shown, perhaps it might provide a pointer or two. In the code example below, I've used the filehandles `\DATA` and `\STDOUT`. You're going to need to create filehandles to real files. I recommend you stop using the 2-argument form of open with global, package variables and, instead, use the 3-argument form with lexical filehandles. Text::CSV has example code; open also has example code with substantial discussion. Here's the example code I keep talking about: #!/usr/bin/env perl use strict; use warnings; use Text::CSV; my $psv = Text::CSV::->new({sep_char => '\|', allow_whitespace => 1}) or die 'Text::CSV problem: ', Text::CSV::->error_diag(); my %data_for; while (my $row = $psv->getline(\DATA)) { my ($user, $role, $type, $date) = @$row; push @{$data_for{$user}{$type}}, $role; push @{$data_for{$user}{dates}{$type}}, $date; } $psv->eol("\n"); for (sort keys %data_for) { $psv->print(\STDOUT, [ $_, join(',', @{$data_for{$_}{added} \|\| []}), join(',', @{$data_for{$_}{deleted} \|\| []}), join(',', @{$data_for{$_}{dates}{added} \|\| []}, @{$data_for{$_}{dates}{deleted} \|\| []} ), ]); } __DATA__ abc\|admin \|added \| 01072015 abc\|developer \|deleted \|02072015 abc\|deploy \|added \|03072015 xyz \|admin \|deleted \|04072015 xyz\| deploy\|deleted\|05072015 cdf\|deploy\|added\|06072015 [download] Here's the output that script produces: `abc\|admin,deploy\|developer\|01072015,03072015,02072015 cdf\|deploy\|\|06072015 xyz\|\|admin,deploy\|04072015,05072015` [download] Update: I originally wrote that the input was tab-separated; no idea why; obviously it's pipe (`\|`) separated. I've corrected that within the text. I've also changed all (4) instances of `$tsv` to `$psv` and retested the code: output remains the same. Apologies if that caused any confusion. Changes were made 23 minutes after original posting. -- Ken	[reply] [d/l] [select]