Re: Calculating the average on a timeslice of data
by Tanktalus (Canon) on Jul 06, 2011 at 16:08 UTC
|
It's too bad you don't have it in CSV format - you could just use DBD::CSV to get your average ;-) SELECT SUM(VALUE)/COUNT(*) FROM mytable WHERE DATE = ?
If you're feeling really chipper, you could put this data into a real database, and, again, the average will be trivial. While I'm mostly kidding with this one, it really depends on what else you're doing with those 40,000 lines - it may really be cheaper to put it in a database (even SQLite) and use SQL to get your information than to do it yourself. But that's only true if you have more than one query to make against it.
However, if you want to approach it directly, I'm not sure why it isn't feasible to use if ($date eq '0109'). It seems perfectly feasible to me.
my ($total, $count);
while (<$fh>)
{
# omitting any error checking here - you shouldn't omit it, though.
my ($id, $date, $value) = split ' ';
if ($date eq $desired_date)
{
$total += $value;
++$count;
}
}
my $avg = $total / $count;
As for your idea to load everything into hashes, that's fine, too. The huge disadvantage is the amount of RAM you'll use. You'll spend a bunch of time populating the hash, too. If you're only querying one thing out of it, that's all wasted time and space. If you're making multiple queries in the same process, then you can see a speed benefit from not having to re-read the file every time. It can be faster than a database, but will likely use more RAM, and will need to re-parse the file every time you load your perl script, whereas a database would have indexes that would speed things up across multiple processes. So, again, it all depends on your usage.
Most likely, the above code that scans through the file with the if is more than sufficient. | [reply] [d/l] [select] |
|
|
Thanks for the reply and explanation. The reason I was thinking using an if statement is because there are 120 desired dates that I need to calculate an average for, and I didn't want to include 120 if statements in my code.
Also the file being read in may be updated each week with new data and new dates, so ideally I wouldn't have to go back into the code and add new if statements to get the average for the new dates
| [reply] |
|
|
Why would you have multiple if's? You don't hardcode things that vary - that's what variable means. You need a way to figure out what date(s) you want to gather averages for, which may be "all of them", and then vary on that. For example, if you want the user to specify the date, just make it a command line parameter: avg_for.pl 0109. Then you just set my $desired_date = shift @ARGV;, and the code I posted would suffice.
If you want to produce an average for ALL dates, you can use a hash to accumulate it, and then loop through the keys for the output:
use strict;
use warnings;
use Text::Table;
my %totals;
while (<DATA>)
{
my ($id, $date, $value) = split ' ';
$totals{$date}{sum} += $value;
++$totals{$date}{count};
}
my $tbl = Text::Table->new('Date', \' | ', 'Average');
for my $date (sort keys %totals)
{
$tbl->add($date, $totals{$date}{sum} / $totals{$date}{count});
}
print $tbl;
__END__
1 0109 4
1 0209 5
1 0309 8
2 0109 6
2 0209 6
And this produces output like this:
Date | Average
0109 | 5
0209 | 5.5
0309 | 8
| [reply] [d/l] [select] |
Re: Calculating the average on a timeslice of data
by zek152 (Pilgrim) on Jul 06, 2011 at 15:40 UTC
|
Specify how much "a lot of data" is. Are we talking thousands of lines or millions or billions? Personally I would split on white space, then do something like:
if($line[1]=="0109") { $sum += $line[2]; $n+=1; }
I ran this on a file with 1000000 lines and it took about 3 seconds to run. There were 83333 matches for that particular date.
The complexity is O(n) so it should scale linearly with increased input sizes.
| [reply] [d/l] |
|
|
There's about 40,000 lines and unique date values. Do you think it makes more sense to read in the values into hashes and use something like %hash{id}{date}?
| [reply] |
Re: Calculating the average on a timeslice of data
by BrowserUk (Patriarch) on Jul 06, 2011 at 20:27 UTC
|
Not sure why people are suggesting databases for a task that can be done by a one liner on a file of 132000 lines in under 1 second?
[21:23:36.53] C:\test>
perl -anlE"$s{$F[1]}+=$F[2];++$n{$F[1]}}{say $_,':',$s{$_}/$n{$_} for
+sort keys%s" junk.dat
0101:4.563
0102:4.602
0103:4.632
0104:4.557
0105:4.515
0106:4.605
0107:4.59
0108:4.501
0109:4.441
0110:4.439
0111:4.542
0201:4.611
0202:4.461
0203:4.627
0204:4.447
0205:4.537
0206:4.434
0207:4.421
0208:4.412
0209:4.58
0210:4.416
0211:4.431
0301:4.444
0302:4.73
0303:4.541
0304:4.564
0305:4.524
0306:4.596
0307:4.618
0308:4.352
0309:4.331
0310:4.489
0311:4.436
0401:4.6
0402:4.425
0403:4.455
0404:4.451
0405:4.482
0406:4.601
0407:4.677
0408:4.307
0409:4.59
0410:4.528
0411:4.366
0501:4.602
0502:4.471
0503:4.5
0504:4.431
0505:4.372
0506:4.543
0507:4.441
0508:4.499
0509:4.476
0510:4.512
0511:4.575
0601:4.425
0602:4.536
0603:4.522
0604:4.585
0605:4.495
0606:4.425
0607:4.595
0608:4.48
0609:4.553
0610:4.528
0611:4.578
0701:4.38
0702:4.648
0703:4.583
0704:4.409
0705:4.575
0706:4.423
0707:4.352
0708:4.599
0709:4.372
0710:4.564
0711:4.39
0801:4.408
0802:4.51
0803:4.52
0804:4.412
0805:4.581
0806:4.469
0807:4.614
0808:4.632
0809:4.387
0810:4.533
0811:4.403
0901:4.314
0902:4.612
0903:4.463
0904:4.481
0905:4.643
0906:4.454
0907:4.343
0908:4.459
0909:4.593
0910:4.527
0911:4.545
1001:4.655
1002:4.456
1003:4.585
1004:4.536
1005:4.577
1006:4.441
1007:4.648
1008:4.549
1009:4.464
1010:4.696
1011:4.493
1101:4.548
1102:4.534
1103:4.646
1104:4.522
1105:4.522
1106:4.549
1107:4.563
1108:4.439
1109:4.539
1110:4.497
1111:4.531
1201:4.486
1202:4.471
1203:4.54
1204:4.428
1205:4.517
1206:4.506
1207:4.413
1208:4.49
1209:4.418
1210:4.475
1211:4.483
Date:0
[21:23:37.63] C:\test>wc -l junk.dat
132001 junk.dat
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Calculating the average on a timeslice of data
by Limbic~Region (Chancellor) on Jul 06, 2011 at 23:52 UTC
|
perlbrother,
I agree with BrowserUk regarding the non-use of a database for this. Don't get me wrong, databases are great but once you become experienced in perl, this is something you could have solved faster than it took you to write your post. I will walk you through my solution.
First, I considered how to handle the fact that you are interested in only 120 dates. One way to do this would be to provide the desired dates as an input parameter - preferably in a file. The input file becomes another parameter.
Next, I considered how to calculate the average itself. Assuming the values won't result in integer overflow, the simplest method would be to track the sum and total entries. The average is just the quotient of the two. Here is the finished program (not tested).
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Std;
my %opt;
get_args(\%opt);
my $desired = load_date_file($opt{d});
open(my $fh, '<', $opt{i}) or die "Unable to open '$opt{i}' for readin
+g: $!";
while (<$fh>) {
chomp;
my ($id, $date, $val) = split ' ', $_, 3;
next if ! defined $date || ! exists $desired->{$date};
$desired->{$date}{sum} += $val;
$desired->{$date}{cnt}++;
}
for my $date (sort keys %desired) {
my $sum = $desired->{$date}{sum} || 0;
my $cnt = $desired->{$date} || 0;
my $avg = $cnt ? sprintf('%.2f', $sum / $cnt) : 0;
print join(',', $date, $sum, $cnt, $avg), "\n";
}
sub load_date_file {
my ($file) = @_;
my %desired;
open(my $fh, '<', $file) or die "Unable to open '$file' for readin
+g: $!";
while (<$fh>) {
chomp;
if (! /^\d{4}$/) {
warn "'$_' is not in MMYY format - skipping\n";
next;
}
$desired{$_} = undef;
}
return \%desired;
}
sub get_args {
my ($opt) = @_;
my $Usage = qq{Usage: $0 -d <date_file> -i <input_file>
-h : This help message
-d : The (d)ate file
-i : The (i)nput file
} . "\n";
getopts('hd:i:', $opt) or die $Usage;
die $Usage if $opt->{h} || ! defined $opt->{d} || ! defined $opt->
+{i};
}
| [reply] [d/l] |
|
|
Thanks everyone for the replies! And thank you Limbic Region in particular. I will try to work my way through the code, and you're right, I am fairly new at this!
| [reply] |
Re: Calculating the average on a timeslice of data
by locked_user sundialsvc4 (Abbot) on Jul 06, 2011 at 19:35 UTC
|
All things considered, you might wish to take a long, hard look at SQLite. This is, believe it or not, a public domain database system which stores everything in a single file and does not involve a server. So, you get the goodness of “a flat file,” and the power of SQL. You also have very nice interactive utilities available. I find myself using it almost exclusively these days, in situations where I would normally use a flat file, and I almost always find the project taking a positive shift into a new direction because I can push more of the hard work off to SQL queries.
The only thing to bear in mind when using it ... but it is a big thing ... is the importance of “transactions.” If you are updating or inserting something, you need to be doing it within a transaction.
| |