perlbrother has asked for the wisdom of the Perl Monks concerning the following question:

Good morning monks. I have data in the form of
ID Date (MMYY) Value 1 0109 4 1 0209 5 1 0309 8 . . . . . . 2 0109 6 2 0209 6 . . . . . . . . .
And so forth. I want to take an average of every value that equals a specific date, so what is the average value for each ID that matches the date 0109 for example I don't need an entire program written for me, but an explanation of how to best proceed in doing this. Also there is a lot of data, so its not feasible to use if ($date = 0109) { ... } Thanks monks!

Replies are listed 'Best First'.
Re: Calculating the average on a timeslice of data
by Tanktalus (Canon) on Jul 06, 2011 at 16:08 UTC

    It's too bad you don't have it in CSV format - you could just use DBD::CSV to get your average ;-) SELECT SUM(VALUE)/COUNT(*) FROM mytable WHERE DATE = ?

    If you're feeling really chipper, you could put this data into a real database, and, again, the average will be trivial. While I'm mostly kidding with this one, it really depends on what else you're doing with those 40,000 lines - it may really be cheaper to put it in a database (even SQLite) and use SQL to get your information than to do it yourself. But that's only true if you have more than one query to make against it.

    However, if you want to approach it directly, I'm not sure why it isn't feasible to use if ($date eq '0109'). It seems perfectly feasible to me.

    my ($total, $count); while (<$fh>) { # omitting any error checking here - you shouldn't omit it, though. my ($id, $date, $value) = split ' '; if ($date eq $desired_date) { $total += $value; ++$count; } } my $avg = $total / $count;
    As for your idea to load everything into hashes, that's fine, too. The huge disadvantage is the amount of RAM you'll use. You'll spend a bunch of time populating the hash, too. If you're only querying one thing out of it, that's all wasted time and space. If you're making multiple queries in the same process, then you can see a speed benefit from not having to re-read the file every time. It can be faster than a database, but will likely use more RAM, and will need to re-parse the file every time you load your perl script, whereas a database would have indexes that would speed things up across multiple processes. So, again, it all depends on your usage.

    Most likely, the above code that scans through the file with the if is more than sufficient.

      Thanks for the reply and explanation. The reason I was thinking using an if statement is because there are 120 desired dates that I need to calculate an average for, and I didn't want to include 120 if statements in my code. Also the file being read in may be updated each week with new data and new dates, so ideally I wouldn't have to go back into the code and add new if statements to get the average for the new dates

        Why would you have multiple if's? You don't hardcode things that vary - that's what variable means. You need a way to figure out what date(s) you want to gather averages for, which may be "all of them", and then vary on that. For example, if you want the user to specify the date, just make it a command line parameter: avg_for.pl 0109. Then you just set my $desired_date = shift @ARGV;, and the code I posted would suffice.

        If you want to produce an average for ALL dates, you can use a hash to accumulate it, and then loop through the keys for the output:

        use strict; use warnings; use Text::Table; my %totals; while (<DATA>) { my ($id, $date, $value) = split ' '; $totals{$date}{sum} += $value; ++$totals{$date}{count}; } my $tbl = Text::Table->new('Date', \' | ', 'Average'); for my $date (sort keys %totals) { $tbl->add($date, $totals{$date}{sum} / $totals{$date}{count}); } print $tbl; __END__ 1 0109 4 1 0209 5 1 0309 8 2 0109 6 2 0209 6
        And this produces output like this:
        Date | Average 0109 | 5 0209 | 5.5 0309 | 8

Re: Calculating the average on a timeslice of data
by zek152 (Pilgrim) on Jul 06, 2011 at 15:40 UTC

    Specify how much "a lot of data" is. Are we talking thousands of lines or millions or billions? Personally I would split on white space, then do something like:

    if($line[1]=="0109") { $sum += $line[2]; $n+=1; }

    I ran this on a file with 1000000 lines and it took about 3 seconds to run. There were 83333 matches for that particular date.

    The complexity is O(n) so it should scale linearly with increased input sizes.

      There's about 40,000 lines and unique date values. Do you think it makes more sense to read in the values into hashes and use something like %hash{id}{date}?
Re: Calculating the average on a timeslice of data
by BrowserUk (Patriarch) on Jul 06, 2011 at 20:27 UTC

    Not sure why people are suggesting databases for a task that can be done by a one liner on a file of 132000 lines in under 1 second?

    [21:23:36.53] C:\test> perl -anlE"$s{$F[1]}+=$F[2];++$n{$F[1]}}{say $_,':',$s{$_}/$n{$_} for +sort keys%s" junk.dat 0101:4.563 0102:4.602 0103:4.632 0104:4.557 0105:4.515 0106:4.605 0107:4.59 0108:4.501 0109:4.441 0110:4.439 0111:4.542 0201:4.611 0202:4.461 0203:4.627 0204:4.447 0205:4.537 0206:4.434 0207:4.421 0208:4.412 0209:4.58 0210:4.416 0211:4.431 0301:4.444 0302:4.73 0303:4.541 0304:4.564 0305:4.524 0306:4.596 0307:4.618 0308:4.352 0309:4.331 0310:4.489 0311:4.436 0401:4.6 0402:4.425 0403:4.455 0404:4.451 0405:4.482 0406:4.601 0407:4.677 0408:4.307 0409:4.59 0410:4.528 0411:4.366 0501:4.602 0502:4.471 0503:4.5 0504:4.431 0505:4.372 0506:4.543 0507:4.441 0508:4.499 0509:4.476 0510:4.512 0511:4.575 0601:4.425 0602:4.536 0603:4.522 0604:4.585 0605:4.495 0606:4.425 0607:4.595 0608:4.48 0609:4.553 0610:4.528 0611:4.578 0701:4.38 0702:4.648 0703:4.583 0704:4.409 0705:4.575 0706:4.423 0707:4.352 0708:4.599 0709:4.372 0710:4.564 0711:4.39 0801:4.408 0802:4.51 0803:4.52 0804:4.412 0805:4.581 0806:4.469 0807:4.614 0808:4.632 0809:4.387 0810:4.533 0811:4.403 0901:4.314 0902:4.612 0903:4.463 0904:4.481 0905:4.643 0906:4.454 0907:4.343 0908:4.459 0909:4.593 0910:4.527 0911:4.545 1001:4.655 1002:4.456 1003:4.585 1004:4.536 1005:4.577 1006:4.441 1007:4.648 1008:4.549 1009:4.464 1010:4.696 1011:4.493 1101:4.548 1102:4.534 1103:4.646 1104:4.522 1105:4.522 1106:4.549 1107:4.563 1108:4.439 1109:4.539 1110:4.497 1111:4.531 1201:4.486 1202:4.471 1203:4.54 1204:4.428 1205:4.517 1206:4.506 1207:4.413 1208:4.49 1209:4.418 1210:4.475 1211:4.483 Date:0 [21:23:37.63] C:\test>wc -l junk.dat 132001 junk.dat

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Calculating the average on a timeslice of data
by Limbic~Region (Chancellor) on Jul 06, 2011 at 23:52 UTC
    perlbrother,
    I agree with BrowserUk regarding the non-use of a database for this. Don't get me wrong, databases are great but once you become experienced in perl, this is something you could have solved faster than it took you to write your post. I will walk you through my solution.

    First, I considered how to handle the fact that you are interested in only 120 dates. One way to do this would be to provide the desired dates as an input parameter - preferably in a file. The input file becomes another parameter.

    Next, I considered how to calculate the average itself. Assuming the values won't result in integer overflow, the simplest method would be to track the sum and total entries. The average is just the quotient of the two. Here is the finished program (not tested).

    #!/usr/bin/perl use strict; use warnings; use Getopt::Std; my %opt; get_args(\%opt); my $desired = load_date_file($opt{d}); open(my $fh, '<', $opt{i}) or die "Unable to open '$opt{i}' for readin +g: $!"; while (<$fh>) { chomp; my ($id, $date, $val) = split ' ', $_, 3; next if ! defined $date || ! exists $desired->{$date}; $desired->{$date}{sum} += $val; $desired->{$date}{cnt}++; } for my $date (sort keys %desired) { my $sum = $desired->{$date}{sum} || 0; my $cnt = $desired->{$date} || 0; my $avg = $cnt ? sprintf('%.2f', $sum / $cnt) : 0; print join(',', $date, $sum, $cnt, $avg), "\n"; } sub load_date_file { my ($file) = @_; my %desired; open(my $fh, '<', $file) or die "Unable to open '$file' for readin +g: $!"; while (<$fh>) { chomp; if (! /^\d{4}$/) { warn "'$_' is not in MMYY format - skipping\n"; next; } $desired{$_} = undef; } return \%desired; } sub get_args { my ($opt) = @_; my $Usage = qq{Usage: $0 -d <date_file> -i <input_file> -h : This help message -d : The (d)ate file -i : The (i)nput file } . "\n"; getopts('hd:i:', $opt) or die $Usage; die $Usage if $opt->{h} || ! defined $opt->{d} || ! defined $opt-> +{i}; }

    Cheers - L~R

      Thanks everyone for the replies! And thank you Limbic Region in particular. I will try to work my way through the code, and you're right, I am fairly new at this!
Re: Calculating the average on a timeslice of data
by locked_user sundialsvc4 (Abbot) on Jul 06, 2011 at 19:35 UTC

    All things considered, you might wish to take a long, hard look at SQLite.   This is, believe it or not, a public domain database system which stores everything in a single file and does not involve a server.   So, you get the goodness of “a flat file,” and the power of SQL.   You also have very nice interactive utilities available.   I find myself using it almost exclusively these days, in situations where I would normally use a flat file, and I almost always find the project taking a positive shift into a new direction because I can push more of the hard work off to SQL queries.

    The only thing to bear in mind when using it ... but it is a big thing ... is the importance of “transactions.”   If you are updating or inserting something, you need to be doing it within a transaction.