Summing repeated counts for items stored in separate file

dmorgo has asked for the wisdom of the Perl Monks concerning the following question:

I have two files, keys.txt and values.txt, in this form:

keys.txt

Joe
Bob
Sally
Bob
Fred
[download]

values.txt

5
3
7
1
1
[download]

There is a 1-to-1 mapping between lines in the first file and the second file. Let's say these are amounts of tips earned. If a name appears twice, the amounts accumulate. So these files indicate that Joe made $5, Bob made $4, Sally made $7, and Fred made $1.

In the real world there could be many millions of lines, but only a few hundred thousand keys (in this example, the names are the keys).

One can assume the two files are the same number of lines and are always in synch.

What is an elegant way to read these (big) files and print out the total amount, the average amount, the max amount, and the min amount for each key?

The obvious answer is to open each file and read the lines into a hash:

open(KEYS, "keys.txt") or die "error: $!";
open(VALS, "values.txt") or die "error: $!";
my ($key,$value);
my (%tips_by_name, %av_by_name, %max_by_name, %min_by_name);
while ($key=<KEYS>) {
    $value=<VALS>;
    chomp($key);
    chomp($value);
    $tips_by_name{$key} += $value;
    # code to do average, min, max here..
    # I'm not asking about that part, my question is
    # more about the way of reading files
}
close(VALS);
close(KEYS);
# code to print out the results here... again
# not part of my question at the moment.
[download]

Maybe a better way to do it would be to create an object for each key (the flyweight pattern?) but my question is as much about the way to read the files. Is there a more elegant way than this to read two files in lockstep? Is this a job for tie? (which I haven't used much if ever, so forgive me if that's a stupid question).

I know one answer would be to use the UNIX command line utility, paste, like this:

paste keys.txt values.txt > combined.txt
[download]

and then read the one file and do a split. Very simple. But I can't do that in this case and am looking for the best way to do it in Perl.

Comment on Summing repeated counts for items stored in separate file Select or Download Code

Replies are listed 'Best First'.
Re: Summing repeated counts for items stored in separate file by wind (Priest) on Jul 27, 2007 at 22:42 UTC
Put the data into a single hash structure, and then do the math for your formatting at the very end. Throw in a couple of warnings in case the file processing ends prematurely. Use List::Util for quick functions to manipulate a list of values. use List::Util qw(sum max min); use strict; my $keysfile = 'keys.txt'; my $valsfile = 'vals.txt'; open(KEYS, $keysfile) or die "Can't open $keysfile: $!"; open(VALS, $valsfile) or die "Can't open $valsfile: $!"; my %data; while (!eof(KEYS) && !eof(VALS)) { chomp(my $key = <KEYS>); chomp(my $val = <VALS>); push @{$data{$key}}, $val; } warn "Premature end of $keysfile\n" if !eof(VALS); warn "Premature end of $valsfile\n" if !eof(KEYS); foreach my $name (sort keys %data) { my $values = $data{$name}; my $count = @$values; my $total = sum(@$values); my $average = $total / $count; my $max = max(@$values); my $min = min(@$values); print "$name => total=$total, average=$average, max=$max, min=$min +\n"; } [download] - Miller	[reply] [d/l]
Re^2: Summing repeated counts for items stored in separate file by dmorgo (Pilgrim) on Jul 27, 2007 at 23:22 UTC
Thanks, that is a very nice solution. I didn't think of the idea of storing lists under each key then processing them afterward, and didn't know about the eof test. And pretty code too. Thanks again.	[reply]
Re: Summing repeated counts for items stored in separate file by gam3 (Curate) on Jul 28, 2007 at 13:22 UTC
Here is a really simple solution that will not use too much memory. use strict; use vars qw (%data); open(KEYS, "keys.txt") or die "error: keys.txt $!"; open(VALS, "values.txt") or die "error: values.txt $!"; while (my $key = readline(KEYS)) { my $val = readline(VALS); chomp $key; chomp $val; print "$key = $val\n"; unless (exists $data{$key}) { $data{$key}{min} = $data{$key}{max} = $data{$key}{total} = $va +l; $data{$key}{count} = 1; } else { $data{$key}{min} = $val if $data{$key}{min} > $val; $data{$key}{max} = $val if $data{$key}{max} < $val; $data{$key}{total} += $val; $data{$key}{count}++; } } use Data::Dumper; print Dumper \%data; [download] -- gam3 A picture is worth a thousand words, but takes 200K.	[reply] [d/l]
Re^2: Summing repeated counts for items stored in separate file by dmorgo (Pilgrim) on Jul 28, 2007 at 21:10 UTC
Thanks. Nice looking code. Why readline(*KEYS) instead of <KEYS> ?	[reply]
Re^3: Summing repeated counts for items stored in separate file by mjscott2702 (Pilgrim) on Jul 29, 2007 at 23:41 UTC
The readline function is what is used "behind the scenes" to implement the <> operator, and takes a typeglob, hence *KEYS instead of <KEYS>; Keeping track of the min, max, total and count (for average value) may be faster than using the List utilities described earlier, or at least should take less memory, since you only need to keep track of a few values for each key. Although the utilities in question are extremely fast, you might need an array of thousands of values for each key, growing in size as you process the files.	[reply]
Re^4: Summing repeated counts for items stored in separate file by dmorgo (Pilgrim) on Aug 07, 2007 at 01:54 UTC
Re: Summing repeated counts for items stored in separate file by snopal (Pilgrim) on Jul 27, 2007 at 22:44 UTC
I would suggest that your method as documented is both clear and concise. Remember that you are developing code that may need maintenance in months or years by you or someone else. You could play all sorts of obfuscation tricks that look more efficient, but the result will be inflexible code that is hard to read and maintain.	[reply]
Re: Summing repeated counts for items stored in separate file by wfsp (Abbot) on Jul 28, 2007 at 08:00 UTC
"...many millions of lines ... (big) files..." It may be worth considering a disk based hash like DBM::Deep.	[reply]
Re: Summing repeated counts for items stored in separate file by DigitalKitty (Parson) on Jul 28, 2007 at 03:54 UTC
Hi dmorgo. Perhaps a minor point but I felt as though declaring lexical variables in a 'collective fashion': `my ( $key,$value ); # or... my ( %tips_by_name, %av_by_name, %max_by_name, %min_by_name );` [download] might lead to a state of frustration at some point. If one wishes to initialize one of the variables listed, it would be inconvenient to do so. Instead, I would advocate structuring your code like this: `my $key = ''; my $value = ''; my %tips_by_name = ''; my %av_by_name = ''; my %max_by_name = ''; my %min_by_name = '';` [download] Initializing one or more of the variables listed above to another value would now be trivial. Hope this helps, ~Katie	[reply] [d/l] [select]
Re^2: Summing repeated counts for items stored in separate file by Anno (Deacon) on Jul 28, 2007 at 16:58 UTC
Why on earth would you do that? The scalars `$keys` and `$value` don't need initialization, but initializing to an empty string at least makes sense. Initializing a hash as in `my %tips_by_name = '';` [download] is, pardon the bluntness, plain nonsense. Under warnings, Perl warns about it. Anno	[reply] [d/l] [select]
Re^2: Summing repeated counts for items stored in separate file by mrpeabody (Friar) on Jul 28, 2007 at 04:41 UTC
Initializing a variable in dmorgo's code would also be trivial. Your suggestion forces the programmer to do a lot of extra typing up front in order to maybe save a small amount of typing later, which seems like a poor tradeoff. I would also argue that dmorgo's code has a clarity advantage in that it is immediately obvious which variables don't have default values and which do.	[reply]
Re: Summing repeated counts for items stored in separate file by dogz007 (Scribe) on Jul 30, 2007 at 08:23 UTC
Oh, but dmorgo, you can do a simple UNIX-like paste. First construct a `paste` subroutine similar to the one below: `sub paste { my @fh = @_; return sub { chomp( my @line = map { scalar(<$_>) } @fh ); return join ' ', @line; } }` [download] Notice that this sub returns another anonymous sub (called an iterator) that reads one line at a time from each of the input file handles and returns them "pasted" together, separated only by a little whitespace for easy splitting. To take advantage of this code, simply open your files and `paste` them together. `open my $keys, "keys.txt" or die "couldn't open keys.txt"; open my $vals, "vals.txt" or die "couldn't open vals.txt"; my $combined = paste($keys,$vals); print $combined->(), "\n" for 0..4;` [download] Using the example keys and vals given in your question, the output of this reads: `Joe 5 Bob 3 Sally 7 Bob 1 Fred 1` [download] If the deref syntax of `$combined->()` isn't quite to your taste, then whip up a little syntactic sugar to help the medicine go down: `sub NEXTLINE { $_[0]->() } print NEXTLINE($combined), "\n" for 0..4;` [download] There are two benefits to this method that I can think of immediately. One is that it allows you to use a sub similar to the UNIX paste command with which you are already familiar. The second is that it avoids both the generation of a potentially huge `%combined` hash and the generation of a potentially huge `combined.txt` file. The concept of an iterator is not my original idea, but I use it whenever possible to keep my code clean and orderly. See Mark Jason Dominus's excellent text Higher Order Perl for more tricks and details on iterators.	[reply] [d/l] [select]
Re: Summing repeated counts for items stored in separate file by CountZero (Bishop) on Jul 30, 2007 at 06:17 UTC
Another option would be to store all your data in a real database and use standard SQL-queries to get your info. For a once off job that would be overkill, but if the data gets updated regularly, it would be less work on the whole. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]