Creating One Table From Several Hashes

bryced1234 has asked for the wisdom of the Perl Monks concerning the following question:

Hey All-

I want to create one large hash with similar keys that totals the values from a large data file that looks like this:

ZIP, Money, Date, ID
12345, 200, 11062000, C1234
12345, 50, 11062000, C1234
67890, 50, 11072000, D5555

The end product should look something like this:
ZIP, Total (11062000), Total (11072000), ID
12345, 250, 0, C1234
67890, 0, 50, D5555

Essentially, I need to aggregate the based on the ZIP and create new columns for each date.

Here is the code I have thus far:


#Read a file that has the dates and IDs I need...each line of the file
+ looks like...11062000C12345678

open(DATA1, "<", "dates.txt") or die $!;

while (<DATA1>) {

my $dateindex = substr $_, 0, 8;
my $candID = substr $_, 8, 9;

#Now that I have the first date/ID I will now start to aggregate the m
+oney by Zip Code...this requires opening a second file which is forma
+tted like this...ZIP/MONEY/DATE/ID

open(DATA2, "<", "indv2000.txt") or die $!;

my %hash;

while (<DATA2>) {

my $zip = substr $_, 82, 5;
my $money = substr $_, 130, 7;
my $date = substr $_, 122, 8;
my $cand1 = substr $_, 0, 9;

#IF the date and candidate ID is equal to the one fed by the first whi
+le loop...then aggregate the money by zip code

if ($date == $dateindex && $cand1 eq $candID){
                 
  $hash{$zip} += $money;
                                
             }
                    

      }

#NOW that I have the total money for zip code for the day and candidat
+e ID in question....I will print the hash separated by commas with an
+ indication of the date/candidate ID used....I will use the dos comma
+nd > to dump the output into a text file
      
 while ( ($k,$v) = each %hash ) {
 print "ZIP, MONEY, DATE, CANDID \n";
 print "$k,$v, $dateindex, $candID, HASH \n";
      
   }         
        
#repeat until the dates file is done...
            
}
[download]

The output from this code looks like this:
ZIP, MONEY, DATE, CANDID
38401,250, 11062000, C00003418
77024,200, 11062000, C00003418

ZIP, MONEY, DATE, CANDID
75711,1000, 11072000, C00003418
33480,5000, 11072000, C00003418

So forth and so on for every date and candidate. Does anyone have any suggestions about how to modify this code in order to get the output in this format:

ZIP, MONEY (11062000), MONEY (11072000), ...., CANDID
75711, 0, 1000, C00003418
33480, 0, 5000, C00003418

I would appreciate any help one can give.

Thanks!

Comment on Creating One Table From Several Hashes Download Code

Replies are listed 'Best First'.
Re: Creating One Table From Several Hashes by ikegami (Patriarch) on Feb 24, 2009 at 19:38 UTC
Your data: One table per $candid One row per $zip Three columns ($zip, $amt, $candid) One $amt subcolumn per $date Your data structure should reflect that. use strict; use warnings; my %data; my %dates; my %zips; while (<DATA>) { chomp; my ($zip, $amt, $date, $candid) = split; $zips {$zip } = 1; $dates{$date} = 1; $data{$candid}{$zip}{amts}{$date} += $amt; } my @candids = sort keys %data; my @dates = sort keys %dates; my @zips = sort keys %zips; local $, = "\t"; local $\ = "\n"; for my $candid (@candids) { my $table = $data{$candid}; print('CANDID', 'ZIP', (map "MONEY ($_)", @dates)); for my $zip (@zips) { my $row = $table->{$zip}; my @amts = map { $_ \|\| 0 } @{ $row->{amts} }{ @dates }; print($candid, $zip, @amts); } print(''); } __DATA__ 12345 200 11062000 C1234 12345 50 11062000 C1234 67890 50 11072000 D5555 38401 250 11062000 C00003418 77024 200 11062000 C00003418 75711 1000 11072000 C00003418 33480 5000 11072000 C00003418 [download] `CANDID ZIP MONEY (11062000) MONEY (11072000) C00003418 12345 0 0 C00003418 33480 0 5000 C00003418 38401 250 0 C00003418 67890 0 0 C00003418 75711 0 1000 C00003418 77024 200 0 CANDID ZIP MONEY (11062000) MONEY (11072000) C1234 12345 250 0 C1234 33480 0 0 C1234 38401 0 0 C1234 67890 0 0 C1234 75711 0 0 C1234 77024 0 0 CANDID ZIP MONEY (11062000) MONEY (11072000) D5555 12345 0 0 D5555 33480 0 0 D5555 38401 0 0 D5555 67890 0 50 D5555 75711 0 0 D5555 77024 0 0` [download] A minor change would cut out "blank" rows and columns if that's what you prefer. Specifically, collect a lists of zips and dates per cand instead one one global list.	[reply] [d/l] [select]
Re^2: Creating One Table From Several Hashes by bryced1234 (Initiate) on Feb 26, 2009 at 03:33 UTC
Thanks for the help. This does exactly what I needed. However, I think I was mistaken about the way I wanted the output structured. I am analyzing the FEC data set for individual campaign contributions, which has approximately 1.2 million entries. When I attempted to use your code to parse the file it became overwhelming very quickly. Given that, I am going to change the desired output table, to weeks instead of days. So it looks something like this: ZIP, WEEK1 (CANDID 1234), WEEK2 (CANDID 4567)... 64521, 500, 400, ... Where WEEK1 (CANDID 1234) = the total money donated to that candidate from each zip code for that week. I will toy around with this some this weekend and I may post again in the future. Thanks again for all the help!	[reply]
Re^3: Creating One Table From Several Hashes by ikegami (Patriarch) on Feb 26, 2009 at 03:39 UTC
Easy! After `my ($zip, $amt, $date, $candid) = split;` [download] just add `$date = days_to_week($date);` [download] Writing `days_to_week` is left as an exercise for the reader :) (I'd use DateTime.)	[reply] [d/l] [select]
Re^2: Creating One Table From Several Hashes by bryced1234 (Initiate) on Mar 05, 2009 at 13:19 UTC
Hey- I was wondering if I could get some additional advice. I have been working with your code, and I have ran into a slight problem. It seems as though this routine is creating a new date column for each zipcode entry instead of summing the amounts of money for each date and zipcode. The end results is a symmetric matrix, with the columns equaling the number of zipcodes. This is not what should happen. Instead, I was envisioning a matrix with the rows equaling the number of zip codes and the columns equaling the number of dates. I have limited the data set to 4000 zipcodes and 60 days, so the matrix should be 4000 X 60, with each cell representing the total amount of money for that zipcode at that day. So (1,1) => Zipcode 12345 total contribution for 09012000, (1,2) => zipcode 12345 total contribution for 09022000, (2,1) => zipcode 12346 total contribution for 09012000, you get the idea. Instead what I get is a matrix 4000 X 4000, which seems to create new dates for each new zipcode entry. I think what is going on is a keying problem with the hash. Instead of recognizing 09012000 = 09012000 it assumes 09012000 != 09012000 and it is creating a new column 09012000.1 instead of grouping the dates together. Do you have any thoughts on this matter? Thanks.	[reply]
Re^3: Creating One Table From Several Hashes by ikegami (Patriarch) on Mar 05, 2009 at 20:36 UTC
I was envisioning a matrix with the rows equaling the number of zip codes and the columns equaling the number of dates. That's what it does. From my earlier post, `CANDID ZIP MONEY (11062000) MONEY (11072000) C1234 12345 250 0 C1234 33480 0 0 C1234 38401 0 0 C1234 67890 0 0 C1234 75711 0 0 C1234 77024 0 0` [download] 6 zips, so 6 rows. 2 dates, so 2 cols. Note that 250 is the sum of two different amounts. Instead what I get is a matrix 4000 X 4000 Not using my code, you didn't. As for the problem with your code, I don't know what you're expecting from me without showing me the code.	[reply] [d/l]
Re^4: Creating One Table From Several Hashes by bryced1234 (Initiate) on Mar 06, 2009 at 13:28 UTC
Re: Creating One Table From Several Hashes by kennethk (Abbot) on Feb 24, 2009 at 19:42 UTC
I was writing up a bit more, but then noticed the number of posts that appeared. So just three quick notes: You have a CSV file, so rather than using substr with fixed-width columns, perhaps you should consider using Text::CSV. It will make you code more resilient and you'll spend less time thinking how to parse the file. Given you want to organize by an arbitrary number of dates by zip code, it seems like you should use a hash of hashes keyed on ZIP (as others have suggested) - see perllol for operational details. From the content of your files, it sounds like these are pulled from a database. Is there a reason you are using local files as an intermediary rather than DBI?	[reply]
Re^2: Creating One Table From Several Hashes by bryced1234 (Initiate) on Feb 26, 2009 at 03:38 UTC
Thanks for the info (DBI looks useful). I actually misspoke in my original post. The data that I am using is fixed-width, so substr is the way to go. Sorry for the confusion. I am a first time poster here. Next time I will be more precise with my language.	[reply]
Re: Creating One Table From Several Hashes by toolic (Bishop) on Feb 24, 2009 at 19:32 UTC
You could stuff all your data into a Hash-of-Hashes structure, then print out based on the dates you have. #!/usr/bin/env perl use strict; use warnings; my %data; while (<DATA>) { next if /ZIP/; chomp; my ($zip, $money, $date, $id) = split /\s,\s/; $data{$zip}{total} += $money; $data{$zip}{date} = $date; $data{$zip}{id} = $id; } my @dates; for my $zip (keys %data) { push @dates, $data{$zip}{date} if exists $data{$zip}{date}; } print "ZIP"; for (@dates) {print ",Total($_)"} print ",ID\n"; for my $zip (keys %data) { print "$zip, "; for (@dates) { if ($data{$zip}{date} eq $_) { print "$data{$zip}{total},"; } else { print "0,"; } } print "$data{$zip}{id}\n"; } __DATA__ ZIP, Money, Date, ID 12345, 200, 11062000, C1234 12345, 50, 11062000, C1234 67890, 50, 11072000, D5555 [download] prints: `ZIP,Total(11062000),Total(11072000),ID 12345, 250,0,C1234 67890, 0,50,D5555` [download] This matches your desired output, based on the input example.	[reply] [d/l] [select]
Re^2: Creating One Table From Several Hashes by bryced1234 (Initiate) on Feb 26, 2009 at 03:34 UTC
Thanks for the help! See my post below for an explanation of where I am at on this project. Thanks again.	[reply]
Re: Creating One Table From Several Hashes by dwm042 (Priest) on Feb 24, 2009 at 19:36 UTC
Since you're only storing data by zip code you're having to dump your data the moment the date and candID change. I'd use a HoHoH myself. single entries would look something like this: `$hash{$zip}{$ID}{total}` for the monetary total regardless of dates. `$hash{$zip}{$ID}{$date}` for the monetary total for any individual date. Unique dates would be stored as keys in the hash, and you could fetch them with `keys $hash{$zip}{$ID}`.	[reply]