nadocane has asked for the wisdom of the Perl Monks concerning the following question:

Fellows, I am a graduate student who has been away from Perl for 3 or 4 years, and am in need of some assistance. I've got a flat-file (text) database that I need to count the # of tornado days (there is a numeric date field in MM/DD/YYYY format) for each state (a numeric field with each state having a unique ID from 1 to 50)...but the data is arranged by tornado report (such that there are many cases where there are more than 1 tornado per day per state)...so I am looking to count just unique records (1 per day per state) for each year of record, display the results. If anyone could give me any tips on how to get started on this task, it would be much appreciated. Have a good evening, Gordon
  • Comment on Help with calculating stats from a flat-file database

Replies are listed 'Best First'.
Re: Help with calculating stats from a flat-file database
by kvale (Monsignor) on Oct 08, 2004 at 03:55 UTC
    The first thing you have to do is discover the database format. Is it CSV? Then use Text::xSV to help parse it. Is it fixed format? Then use the unpack function to split the data. I have seen meterological information in both formats.

    Once you have the format and can convince yourself that you are parsing the data properly, I would use a hash to count tornado days. Something like this:

    while (<>) { # parse line my ($date, $state) = ... $tornadoes{$state}{$date}++; } foreach my $state (1..50) { print "State = $state, tronadoes = ", scalar keys %{$tornadoes{$ +state}}, "\n"; }
    If your dates span more than a year, you will have to add extra logic to parse and filter dates.

    -Mark

Re: Help with calculating stats from a flat-file database
by ikegami (Patriarch) on Oct 08, 2004 at 03:56 UTC
    @STATE_LOOKUP = ( undef, # There's no state 0 'Maine', 'Hawaii', 'Canada', 'New York', ... ); my $data; open(FILE, ...) or die(...); while (defined($line = <FILE>)) { my ($state_id, $date) = get_fields($line); my $date_n = convert_date_to_number($date); $data->[$state_id]->{$date}++; } close(FILE); { # Sample output. my ($state_id, $date_n); for ($state_id=1; $state_id<=50; $state_id++) { my $state = $STATE_LOOKUP[$state_id]; foreach $date_n (sort { $a <=> $b } keys %{$data->[$state]}) { my $date = convert_date_to_string($date_n); my $num_tornadoes = $data->[$state]{$date_n}; printf("State %s had %d tornado(es) on %s\n", $state, $num_tornadoes, $date, ) if $num_tornadoes; } } }

    get_fields depends on the format of your data.

    Look into the Time modules for help implementing convert_date_to_number.

    Others will probably use a hash for states without thinking, but an array would be more efficient and just as powerful (assuming you have reports for many states).

Re: Help with calculating stats from a flat-file database
by davido (Cardinal) on Oct 08, 2004 at 04:00 UTC

    As you read though the file, construct a datastructure like this:

    %events = ( 1 => { '10/27/2002' => 1, '09/15/2004' => 2, '03/24/2003' => 1 }, 2 => { '02/17/2001' => 1, '12/05/2003' => 2 }, 3 => ..... );

    That way, as you iterate through the flat file, you'll use state ID to get to the top-level of the hash of hashes, and dates will become the keys for the 2nd level. You don't really care about an event count per date, but as long as you're using hash keys to guarantee unique dates within each state, there's no reason not to use a ++ to increment the event count for that particular date.

    Now, you can count event dates in state 4 like this:

    my $date_count = keys %{$events{4}};

    It's all about the right datastructure. ...having said that, someone undoubtedly will come up with an even better one. ;)

    On the other hand, if you plan to search by date more often than by state, reverse the order so that the top level key is a date, the second level is state.

    Update: Just saw the comment that "others will use a hash without thinking." I did think about it, and decided to use a hash because not all 50 states have tornados (ie, a sparse array, which is a good use of a hash). I don't think you'll ever see one in Hawaii, nor in Washington or Vermont. ...at least maybe not within a narrow timeframe of a few years. Also I chose the hash approach on the off chance that as the script gets refined, state numbers might end up becoming state names or state abbreviations, which lend themselves better to hash keys. After discussing it in CB I have to admit that an array will give a more efficient by-state lookup, while the hash will make it easier to sum up the number of states that had an event. Maybe it's a tossup. ;)


    Dave

Re: Help with calculating stats from a flat-file database
by graff (Chancellor) on Oct 08, 2004 at 04:11 UTC
    If I understand the description, you want a boolean value for each date+state tuple -- that is: "there (was | was not) at least one tornado on date X in state Y".

    If that is the case, think about creating an array of 50 elements (one for each state). The value of each element is a reference to a hash, where the keys of the hash are the dates when one or more tornados occurred in the given state (the value assigned to each element is of no importance: it is the existence of a given date as a hash key for the given state that matters). So, suppose the data is like this:

    State Date other stuff 1 03/03/2003 one bad mother of a storm... 5 03/05/2003 nothing to write home about... 8 02/08/2003 they said it could never happen here ...
    You might try something like this, just to list how many tornado days there were in each state, based on the given input report:
    my @states; # array holding one hash per state while (<>) { my ($state_number,$date) = split; $states[$state_number]{$date} = undef; } # now to summarize: # for each state that had any tornados at all, # list number of tornado days: for ( 1 .. 50 ) { if ( defined $states[$_] ) { printf( "state_id %2d: %3d tornado days\n", $_, scalar keys %{$states[$_]} ); } }
    (untested, of course)
Re: Help with calculating stats from a flat-file database
by TedPride (Priest) on Oct 08, 2004 at 06:11 UTC
    Assuming your file format is tab delimited with the fields mm/dd/yy, state num, everything else:
    foreach (<MYHANDLE>) { $_ =~ /^(\d+)\/(\d+)\/(\d+)\t(\d+)\t/; if (!$h{"$1$2$3$4"}) { $h{"$1$2$3$4"} = 1; $c{$3}{$4}++; } } foreach $year (sort keys %c) { print "$year\n"; $p = $c{$year}; foreach (sort {$a <=> $b} keys %$p) { print "$_ " . $c{$year}{$_} . "\n"; } print "\n"; }
    MYHANDLE can be replaced with the handle of your flat file after you open it.