bryced1234 has asked for the wisdom of the Perl Monks concerning the following question:

Hey All-

I want to create one large hash with similar keys that totals the values from a large data file that looks like this:

ZIP, Money, Date, ID
12345, 200, 11062000, C1234
12345, 50, 11062000, C1234
67890, 50, 11072000, D5555

The end product should look something like this:
ZIP, Total (11062000), Total (11072000), ID
12345, 250, 0, C1234
67890, 0, 50, D5555

Essentially, I need to aggregate the based on the ZIP and create new columns for each date.


Here is the code I have thus far:
#Read a file that has the dates and IDs I need...each line of the file + looks like...11062000C12345678 open(DATA1, "<", "dates.txt") or die $!; while (<DATA1>) { my $dateindex = substr $_, 0, 8; my $candID = substr $_, 8, 9; #Now that I have the first date/ID I will now start to aggregate the m +oney by Zip Code...this requires opening a second file which is forma +tted like this...ZIP/MONEY/DATE/ID open(DATA2, "<", "indv2000.txt") or die $!; my %hash; while (<DATA2>) { my $zip = substr $_, 82, 5; my $money = substr $_, 130, 7; my $date = substr $_, 122, 8; my $cand1 = substr $_, 0, 9; #IF the date and candidate ID is equal to the one fed by the first whi +le loop...then aggregate the money by zip code if ($date == $dateindex && $cand1 eq $candID){ $hash{$zip} += $money; } } #NOW that I have the total money for zip code for the day and candidat +e ID in question....I will print the hash separated by commas with an + indication of the date/candidate ID used....I will use the dos comma +nd > to dump the output into a text file while ( ($k,$v) = each %hash ) { print "ZIP, MONEY, DATE, CANDID \n"; print "$k,$v, $dateindex, $candID, HASH \n"; } #repeat until the dates file is done... }

The output from this code looks like this:
ZIP, MONEY, DATE, CANDID
38401,250, 11062000, C00003418
77024,200, 11062000, C00003418

ZIP, MONEY, DATE, CANDID
75711,1000, 11072000, C00003418
33480,5000, 11072000, C00003418

So forth and so on for every date and candidate. Does anyone have any suggestions about how to modify this code in order to get the output in this format:

ZIP, MONEY (11062000), MONEY (11072000), ...., CANDID
75711, 0, 1000, C00003418
33480, 0, 5000, C00003418

I would appreciate any help one can give.

Thanks!

Replies are listed 'Best First'.
Re: Creating One Table From Several Hashes
by ikegami (Patriarch) on Feb 24, 2009 at 19:38 UTC
    Your data:
    • One table per $candid
    • One row per $zip
    • Three columns ($zip, $amt, $candid)
    • One $amt subcolumn per $date

    Your data structure should reflect that.

    use strict; use warnings; my %data; my %dates; my %zips; while (<DATA>) { chomp; my ($zip, $amt, $date, $candid) = split; $zips {$zip } = 1; $dates{$date} = 1; $data{$candid}{$zip}{amts}{$date} += $amt; } my @candids = sort keys %data; my @dates = sort keys %dates; my @zips = sort keys %zips; local $, = "\t"; local $\ = "\n"; for my $candid (@candids) { my $table = $data{$candid}; print('CANDID', 'ZIP', (map "MONEY ($_)", @dates)); for my $zip (@zips) { my $row = $table->{$zip}; my @amts = map { $_ || 0 } @{ $row->{amts} }{ @dates }; print($candid, $zip, @amts); } print(''); } __DATA__ 12345 200 11062000 C1234 12345 50 11062000 C1234 67890 50 11072000 D5555 38401 250 11062000 C00003418 77024 200 11062000 C00003418 75711 1000 11072000 C00003418 33480 5000 11072000 C00003418
    CANDID ZIP MONEY (11062000) MONEY (11072000) C00003418 12345 0 0 C00003418 33480 0 5000 C00003418 38401 250 0 C00003418 67890 0 0 C00003418 75711 0 1000 C00003418 77024 200 0 CANDID ZIP MONEY (11062000) MONEY (11072000) C1234 12345 250 0 C1234 33480 0 0 C1234 38401 0 0 C1234 67890 0 0 C1234 75711 0 0 C1234 77024 0 0 CANDID ZIP MONEY (11062000) MONEY (11072000) D5555 12345 0 0 D5555 33480 0 0 D5555 38401 0 0 D5555 67890 0 50 D5555 75711 0 0 D5555 77024 0 0

    A minor change would cut out "blank" rows and columns if that's what you prefer. Specifically, collect a lists of zips and dates per cand instead one one global list.

      Thanks for the help. This does exactly what I needed. However, I think I was mistaken about the way I wanted the output structured. I am analyzing the FEC data set for individual campaign contributions, which has approximately 1.2 million entries. When I attempted to use your code to parse the file it became overwhelming very quickly. Given that, I am going to change the desired output table, to weeks instead of days.

      So it looks something like this:

      ZIP, WEEK1 (CANDID 1234), WEEK2 (CANDID 4567)...
      64521, 500, 400, ...

      Where WEEK1 (CANDID 1234) = the total money donated to that
      candidate from each zip code for that week. I will toy around with this some this weekend and I may post again in the future.

      Thanks again for all the help!

        Easy! After
        my ($zip, $amt, $date, $candid) = split;
        just add
        $date = days_to_week($date);

        Writing days_to_week is left as an exercise for the reader :) (I'd use DateTime.)

      Hey-

      I was wondering if I could get some additional advice. I have been working with your code, and I have ran into a slight problem. It seems as though this routine is creating a new date column for each zipcode entry instead of summing the amounts of money for each date and zipcode. The end results is a symmetric matrix, with the columns equaling the number of zipcodes. This is not what should happen. Instead, I was envisioning a matrix with the rows equaling the number of zip codes and the columns equaling the number of dates. I have limited the data set to 4000 zipcodes and 60 days, so the matrix should be 4000 X 60, with each cell representing the total amount of money for that zipcode at that day. So (1,1) => Zipcode 12345 total contribution for 09012000, (1,2) => zipcode 12345 total contribution for 09022000, (2,1) => zipcode 12346 total contribution for 09012000, you get the idea. Instead what I get is a matrix 4000 X 4000, which seems to create new dates for each new zipcode entry. I think what is going on is a keying problem with the hash. Instead of recognizing 09012000 = 09012000 it assumes 09012000 != 09012000 and it is creating a new column 09012000.1 instead of grouping the dates together. Do you have any thoughts on this matter? Thanks.

        I was envisioning a matrix with the rows equaling the number of zip codes and the columns equaling the number of dates.

        That's what it does. From my earlier post,

        CANDID ZIP MONEY (11062000) MONEY (11072000) C1234 12345 250 0 C1234 33480 0 0 C1234 38401 0 0 C1234 67890 0 0 C1234 75711 0 0 C1234 77024 0 0

        6 zips, so 6 rows.
        2 dates, so 2 cols.

        Note that 250 is the sum of two different amounts.

        Instead what I get is a matrix 4000 X 4000

        Not using my code, you didn't. As for the problem with your code, I don't know what you're expecting from me without showing me the code.

Re: Creating One Table From Several Hashes
by kennethk (Abbot) on Feb 24, 2009 at 19:42 UTC

    I was writing up a bit more, but then noticed the number of posts that appeared. So just three quick notes:

    1. You have a CSV file, so rather than using substr with fixed-width columns, perhaps you should consider using Text::CSV. It will make you code more resilient and you'll spend less time thinking how to parse the file.
    2. Given you want to organize by an arbitrary number of dates by zip code, it seems like you should use a hash of hashes keyed on ZIP (as others have suggested) - see perllol for operational details.
    3. From the content of your files, it sounds like these are pulled from a database. Is there a reason you are using local files as an intermediary rather than DBI?
      Thanks for the info (DBI looks useful). I actually misspoke in my original post. The data that I am using is fixed-width, so substr is the way to go. Sorry for the confusion. I am a first time poster here. Next time I will be more precise with my language.
Re: Creating One Table From Several Hashes
by toolic (Bishop) on Feb 24, 2009 at 19:32 UTC
    You could stuff all your data into a Hash-of-Hashes structure, then print out based on the dates you have.
    #!/usr/bin/env perl use strict; use warnings; my %data; while (<DATA>) { next if /ZIP/; chomp; my ($zip, $money, $date, $id) = split /\s*,\s*/; $data{$zip}{total} += $money; $data{$zip}{date} = $date; $data{$zip}{id} = $id; } my @dates; for my $zip (keys %data) { push @dates, $data{$zip}{date} if exists $data{$zip}{date}; } print "ZIP"; for (@dates) {print ",Total($_)"} print ",ID\n"; for my $zip (keys %data) { print "$zip, "; for (@dates) { if ($data{$zip}{date} eq $_) { print "$data{$zip}{total},"; } else { print "0,"; } } print "$data{$zip}{id}\n"; } __DATA__ ZIP, Money, Date, ID 12345, 200, 11062000, C1234 12345, 50, 11062000, C1234 67890, 50, 11072000, D5555

    prints:

    ZIP,Total(11062000),Total(11072000),ID 12345, 250,0,C1234 67890, 0,50,D5555

    This matches your desired output, based on the input example.

      Thanks for the help! See my post below for an explanation of where I am at on this project. Thanks again.
Re: Creating One Table From Several Hashes
by dwm042 (Priest) on Feb 24, 2009 at 19:36 UTC
    Since you're only storing data by zip code you're having to dump your data the moment the date and candID change. I'd use a HoHoH myself.

    single entries would look something like this:

    $hash{$zip}{$ID}{total} for the monetary total regardless of dates.
    $hash{$zip}{$ID}{$date} for the monetary total for any individual date.

    Unique dates would be stored as keys in the hash, and you could fetch them with keys $hash{$zip}{$ID}.