Best way to store/sum multiple-field records?

bobdabuilda has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, quite rusty on my Perl as I've been stuck doing other stuff for quite some time. Am currently trying to help a colleague sort out an issue he has with a small Perl script he's inherited.

Basically, the script reads in some data in the form of:

USERID1|2215|Jones,Tom|
USERID1|1000|Jones, Tom|
USERID3|1495|Dole, Bob|
USERID2|2500|Francis, Pope|
USERID2|1500|Francis, Pope|
[download]

The goal is to process each of these records, keep a running total of the values in the second field, then output the ID, the sum of the second field for that ID, and the name.

I managed to fix a couple of minor things that were wrong in it (no trailing pipe on the end of the record causing additional blank lines in the output, small things like that), but am not sure on the best way to handle this with the additional field. The (broken) code he currently has is:

#!/usr/bin/perl
use strict;
use warnings;

my ($key,$value,$reason);
my %sum=();
while(<>)
    {
    unless(/.*\|(\d+)/){print STDERR "dropped line: \"$_\""; next;}
    ($key,$value,$reason) = split(/\|/);
    $sum{$key}+=$value;
    }

foreach $key (keys %sum){print "$key|$sum{$key}|$reason\n";}
[download]

Running the above gives the following result:

USERID3|1495|Francis, Pope
USERID1|3215|Francis, Pope
USERID2|4000|Francis, Pope
[download]

The obvious issue here is $reason - it's not being stored anywhere during the loop, so just repeating the $reason from the last record in the data... but I'm not sure of the best way to do this? It's easy enough when just dealing with the first two fields, as you can use a simple hash, as is being done already.

I'm just not sure of the best way to add the extra info into this? Is a hash of hashes required to get the additional element in? I've not played with complex hashes before, so suggestions/examples would be much appreciated.

Comment on Best way to store/sum multiple-field records? Select or Download Code

Replies are listed 'Best First'.
Re: Best way to store/sum multiple-field records? by toolic (Bishop) on Dec 23, 2014 at 00:02 UTC
`use warnings; use strict; my %data; while (<DATA>) { chomp; my ($id, $sum, $reason) = split /\\|/; $data{$id}{sum} += $sum; $data{$id}{reason} = $reason; } use Data::Dumper; $Data::Dumper::Sortkeys=1; print Dumper(\%data); __DATA__ USERID1\|2215\|Jones,Tom\| USERID1\|1000\|Jones, Tom\| USERID3\|1495\|Dole, Bob\| USERID2\|2500\|Francis, Pope\| USERID2\|1500\|Francis, Pope\|` [download] perldsc	[reply] [d/l]
Re: Best way to store/sum multiple-field records? by GrandFather (Saint) on Dec 23, 2014 at 00:23 UTC
A nested hash structure is probably the best solution: #!/usr/bin/perl use strict; use warnings; my %sum; while (<DATA>) { chomp; my ($key, $value, $reason) = split(/\\|/); if (! defined $reason \|\| $value !~ /^\d+$/) { warn qq<dropped line: "$_"\n>; next; } $sum{$key}{value} += $value; push @{$sum{$key}{reason}}, $reason; } local $" = '/'; for my $key (keys %sum) { print "$key\|$sum{$key}{value}\|@{$sum{$key}{reason}}\n"; } __DATA__ USERID1\|2215\|Jones,Tom\| USERID1\|1000\|Jones, Tom\| broken line USERID3\|1495\|Dole, Bob\| USERID2\|2500\|Francis, Pope\| USERID2\|1500\|Francis, Pope\| [download] Prints: `dropped line: "broken line" USERID3\|1495\|Dole, Bob USERID1\|3215\|Jones,Tom/Jones, Tom USERID2\|4000\|Francis, Pope/Francis, Pope` [download] There are a few coding things to watch out for: Don't declare a bunch of variables in a list. That's often a code smell that says "just doing this to shut up strictures, but I really don't know why I need to". Instead declare variables where they are first needed. Ideally declare a variable where it is initialized. Don't initialize arrays and hashes with an empty list. They are empty out of the box. Avoid unless and until. If their expressions are anything other than completely trivial they become almost impossible to understand. Don't put multiple statements on a single line and don't put the block for a loop or if statement on the same line as the loop/if. Perl is the programming world's equivalent of English	[reply] [d/l] [select]
Re^2: Best way to store/sum multiple-field records? by bobdabuilda (Beadle) on Dec 23, 2014 at 00:35 UTC
Thanks GrandFather and toolic (I did reply to yours also, but no idea where the reply went... actually, now that I am about to post this, I suspect I may have only previewed my response to you and missed the "create" step). Both of those look good, and I should be able to work my way through them to sort out where I went wrong and help him get what he's after. Very much appreciated. Thanks also for the tips, GrandFather. One of the things I actually had in my vapourised response to toolic, was a query about the declaration of the variables, and the fact he'd moved them into the loop. I would have thought that declaring variables over and over would be less efficient than declaring them once at the start - that was my reasoning for declaring them before the loop, anyways... Obviously, from your comments, that's not the case. Will remember that for future, thanks.	[reply]
Re^3: Best way to store/sum multiple-field records? by GrandFather (Saint) on Dec 23, 2014 at 01:34 UTC
Never code for "efficiency". Instead code for clarity and maintainability. Compared with almost anything else your code does, declaring variables takes no time at all. Even if it took a huge amount of time like 1/1000 of a second, that is still tiny compared to the time it takes to read a line from disk and process it. And even if it represented a large portion of the time for each loop iteration, unless you are processing thousands of lines that overhead just isn't noticeable. In practice the overhead is likely to be much less than 1 millionth of a second and nothing to worry about ever. Just remember: premature optimization is the root of all evil. Perl is the programming world's equivalent of English	[reply]
Re^3: Best way to store/sum multiple-field records? by Anonymous Monk on Dec 23, 2014 at 01:36 UTC
I would have thought that declaring variables over and over would be less efficient than declaring them once at the start - that was my reasoning for declaring them before the loop, anyways... Not only 'premature optimization is the root of all evil'; not only such a microoptimization is completely meaningless; but it's actually the other way around... declaring variables inside a loop is quite a bit faster. I guess due to Perl's own optimizations... `use strict; use warnings; use Benchmark qw( cmpthese ); my @strings = qw( USERID1\|2215\|Jones\| USERID1\|1000\|Jones\| USERID3\|1495\|Dole\| USERID2\|2500\|Francis\| USERID2\|1500\|Francis\| ); cmpthese( 1_000_000, { outside => sub { my ( $x, $y, $z ); for (@strings) { ( $x, $y, $z ) = split /\\|/; } }, inside => sub { for (@strings) { my ( $x, $y, $z ) = split /\\|/; } } } );` [download] result: `Rate outside inside outside 109890/s -- -38% inside 176678/s 61% --` [download]	[reply] [d/l] [select]
Re^4: Best way to store/sum multiple-field records? by choroba (Cardinal) on Dec 23, 2014 at 02:10 UTC
Re^4: Best way to store/sum multiple-field records? by BrowserUk (Patriarch) on Dec 23, 2014 at 02:52 UTC
Re^5: Best way to store/sum multiple-field records? (carte blanche) by tye (Sage) on Dec 23, 2014 at 05:39 UTC
Some notes below your chosen depth have not been shown here
Re^5: Best way to store/sum multiple-field records? by GotToBTru (Prior) on Dec 23, 2014 at 21:42 UTC
Some notes below your chosen depth have not been shown here
Re^2: Best way to store/sum multiple-field records? by bobdabuilda (Beadle) on Dec 23, 2014 at 00:48 UTC
Actually, I just noticed the "double up" of the names in the output - would that best be counteracted by another "defined" check along the lines of: `if (! defined $sum{$key}{reason}) { push @{$sum{$key}{reason}}, $reason; }` [download]	[reply] [d/l]
Re^3: Best way to store/sum multiple-field records? by GrandFather (Saint) on Dec 23, 2014 at 01:37 UTC
Did you try it? Did it work? How do you expect "Jones,Tom/Jones, Tom" to be handled? Aside from the formatting issue, I'd be inclined to use a nested hash instead of an array if you want to suppress duplicates. Perl is the programming world's equivalent of English	[reply]
Re^4: Best way to store/sum multiple-field records? by bobdabuilda (Beadle) on Dec 23, 2014 at 04:29 UTC
Re^5: Best way to store/sum multiple-field records? by GrandFather (Saint) on Dec 23, 2014 at 05:26 UTC
Re^4: Best way to store/sum multiple-field records? by bobdabuilda (Beadle) on Dec 23, 2014 at 03:10 UTC
Re: Best way to store/sum multiple-field records? by Laurent_R (Canon) on Dec 23, 2014 at 00:52 UTC
The best way is definitely to use a hash of hashes, as shown by toolic. Nothing complicated in that case really try to use his proposal. Now, if you don't feel comfortable with nested data structures, you have the right to use two hashes, one for summing the values, and another one for the mapping of USERIDs to names. This is maybe slightly less efficient (at least in terms of coding and possibly storing performances), but do we care, provided that it does what you want? No, we usually don't. Perl has a tradition of being nice with people writing "baby Perl", i.e. using only a relatively small subset of the language. As chromatic says in the Modern Perl book, "this baby Perl is a term of endearment, because everyone begins as a novice." The Programming Perl (or "Camel book") reference book is on the same line. And this is also my personal experience: I am rather happy that I have no longer have the source of my first Perl programs, I would probably find them very badly written. Yet, they did what I wanted, and they did it quite efficiently.	[reply]