in reply to Creating One Table From Several Hashes

Your data:

Your data structure should reflect that.

use strict; use warnings; my %data; my %dates; my %zips; while (<DATA>) { chomp; my ($zip, $amt, $date, $candid) = split; $zips {$zip } = 1; $dates{$date} = 1; $data{$candid}{$zip}{amts}{$date} += $amt; } my @candids = sort keys %data; my @dates = sort keys %dates; my @zips = sort keys %zips; local $, = "\t"; local $\ = "\n"; for my $candid (@candids) { my $table = $data{$candid}; print('CANDID', 'ZIP', (map "MONEY ($_)", @dates)); for my $zip (@zips) { my $row = $table->{$zip}; my @amts = map { $_ || 0 } @{ $row->{amts} }{ @dates }; print($candid, $zip, @amts); } print(''); } __DATA__ 12345 200 11062000 C1234 12345 50 11062000 C1234 67890 50 11072000 D5555 38401 250 11062000 C00003418 77024 200 11062000 C00003418 75711 1000 11072000 C00003418 33480 5000 11072000 C00003418
CANDID ZIP MONEY (11062000) MONEY (11072000) C00003418 12345 0 0 C00003418 33480 0 5000 C00003418 38401 250 0 C00003418 67890 0 0 C00003418 75711 0 1000 C00003418 77024 200 0 CANDID ZIP MONEY (11062000) MONEY (11072000) C1234 12345 250 0 C1234 33480 0 0 C1234 38401 0 0 C1234 67890 0 0 C1234 75711 0 0 C1234 77024 0 0 CANDID ZIP MONEY (11062000) MONEY (11072000) D5555 12345 0 0 D5555 33480 0 0 D5555 38401 0 0 D5555 67890 0 50 D5555 75711 0 0 D5555 77024 0 0

A minor change would cut out "blank" rows and columns if that's what you prefer. Specifically, collect a lists of zips and dates per cand instead one one global list.

Replies are listed 'Best First'.
Re^2: Creating One Table From Several Hashes
by bryced1234 (Initiate) on Feb 26, 2009 at 03:33 UTC
    Thanks for the help. This does exactly what I needed. However, I think I was mistaken about the way I wanted the output structured. I am analyzing the FEC data set for individual campaign contributions, which has approximately 1.2 million entries. When I attempted to use your code to parse the file it became overwhelming very quickly. Given that, I am going to change the desired output table, to weeks instead of days.

    So it looks something like this:

    ZIP, WEEK1 (CANDID 1234), WEEK2 (CANDID 4567)...
    64521, 500, 400, ...

    Where WEEK1 (CANDID 1234) = the total money donated to that
    candidate from each zip code for that week. I will toy around with this some this weekend and I may post again in the future.

    Thanks again for all the help!

      Easy! After
      my ($zip, $amt, $date, $candid) = split;
      just add
      $date = days_to_week($date);

      Writing days_to_week is left as an exercise for the reader :) (I'd use DateTime.)

Re^2: Creating One Table From Several Hashes
by bryced1234 (Initiate) on Mar 05, 2009 at 13:19 UTC
    Hey-

    I was wondering if I could get some additional advice. I have been working with your code, and I have ran into a slight problem. It seems as though this routine is creating a new date column for each zipcode entry instead of summing the amounts of money for each date and zipcode. The end results is a symmetric matrix, with the columns equaling the number of zipcodes. This is not what should happen. Instead, I was envisioning a matrix with the rows equaling the number of zip codes and the columns equaling the number of dates. I have limited the data set to 4000 zipcodes and 60 days, so the matrix should be 4000 X 60, with each cell representing the total amount of money for that zipcode at that day. So (1,1) => Zipcode 12345 total contribution for 09012000, (1,2) => zipcode 12345 total contribution for 09022000, (2,1) => zipcode 12346 total contribution for 09012000, you get the idea. Instead what I get is a matrix 4000 X 4000, which seems to create new dates for each new zipcode entry. I think what is going on is a keying problem with the hash. Instead of recognizing 09012000 = 09012000 it assumes 09012000 != 09012000 and it is creating a new column 09012000.1 instead of grouping the dates together. Do you have any thoughts on this matter? Thanks.

      I was envisioning a matrix with the rows equaling the number of zip codes and the columns equaling the number of dates.

      That's what it does. From my earlier post,

      CANDID ZIP MONEY (11062000) MONEY (11072000) C1234 12345 250 0 C1234 33480 0 0 C1234 38401 0 0 C1234 67890 0 0 C1234 75711 0 0 C1234 77024 0 0

      6 zips, so 6 rows.
      2 dates, so 2 cols.

      Note that 250 is the sum of two different amounts.

      Instead what I get is a matrix 4000 X 4000

      Not using my code, you didn't. As for the problem with your code, I don't know what you're expecting from me without showing me the code.

        You were right.

        It wasn't a code issue, it was a data issue. For some reason when I filtered the data in order to exclude all but a few dates it didn't work. I forgot to check it before running the script...lesson learned. I figured out the filter issue, and now everything is copacetic. Thanks again for the help, and sorry for not including my code. I was sort of in a scramble mode after a very long day of working on this project.