Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by rjt (Curate) on Aug 07, 2013 at 22:41 UTC
|
use Statistics::Descriptive;
my @ranges = (19, 29, 39); # 1..19, 20..29, 30..39
my $stat = Statistics::Descriptive::Full->new;
$stat->add_data( 12, 14, 15, 20, 21 );
my %dist = $stat->frequency_distribution(\@ranges);
printf " %5d: %4d\n", $_, $dist{$_} for @ranges;
Output:
19: 3
29: 2
39: 0
use strict; use warnings; omitted for brevity.
| [reply] [d/l] [select] |
|
|
I didn't know about this module, but sounds like a great alternative for my problem, thanks!
| [reply] |
|
|
| |
|
|
worked out great, now the script runs much more smoothly, thanks!
| [reply] |
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by Eily (Monsignor) on Aug 07, 2013 at 22:23 UTC
|
You could go faster by completing your rangeHash straightaway without going through dataHash first.
Assuming your ranges can't overlap, all you have to do is compute the range from the value and use it as a key. For example: (the getRange sub is far from being perfect, I went for a simple solution since you probably already have code to check which range you are in) :
use Data::Dumper;
sub getRange
{
my $value = shift;
warn "Unexpected value $value," and return if $value < 1;
return '1-19' if $value < 20;
return '20-29' if $value < 30;
return '30-39' if $value < 40;
return '40-59' if $value < 60;
warn "Out of range value $value," and return;
}
my %hits;
while (<DATA>)
{
my $range = getRange($_);
$hits{$range}++ if defined $range;
}
print Dumper \%hits;
__DATA__
12
14
15
20
21
50
42
The result is :
$VAR1 = {
'1-19' => 3,
'40-59' => 2,
'20-29' => 2
};
| [reply] [d/l] [select] |
|
|
Another good one! Thanks, will test this out!
| [reply] |
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by kcott (Archbishop) on Aug 08, 2013 at 06:42 UTC
|
G'day SaktiMonk,
Welcome to the monastery.
Update:
My original solution (now within <spoiler> tags below) didn't take into account uneven sized ranges.
I didn't notice that but ++Laurent_R did (see below).
Here's a new version that still uses an array (@bins) for the range counts; caches the results of determining the range; and still only uses one iteration per range to output the bin counts.
$ perl -Mstrict -Mwarnings -e '
use List::Util qw{first};
my @input = qw{12 14 15 20 21};
my @range_ends = qw{19 29 39};
my (@bins, %range_index_cache);
sub get_range_index {
( first { $_[0] <= $range_ends[$_] } 0 .. $#range_ends ) // @r
+ange_ends;
}
for (@input) {
++$bins[$range_index_cache{$_} //= get_range_index($_)];
}
for (0 .. $#range_ends) {
printf "%d-%d %d\n" =>
$_ ? $range_ends[$_ - 1] + 1 : 1, $range_ends[$_], $bins[$
+_] // 0;
}
'
1-19 3
20-29 2
30-39 0
Any datapoints beyond the ranges you specify are counted in the last element of @bins.
Here's a solution that only has one iteration per datapoint to populate the bins; and one iteration per range to output the bin counts.
$ perl -Mstrict -Mwarnings -e '
my @input = qw{12 14 15 20 21};
my $ranges = 3;
my @bins;
for (@input) {
++$bins[$_ / 20];
}
for (0 .. $ranges - 1) {
printf "%d-%d %d\n" => $_ * 20 || 1, $_ * 20 + 19, $bins[$_] /
+/ 0;
}
'
1-19 3
20-39 2
40-59 0
[Note: If you're using an old version of Perl that doesn't have the // (Logical Defined-Or) operator, you'll need to use the defined function.]
| [reply] [d/l] [select] |
|
|
I also thought about this possible solution, which would be very fast (and easy), but that works only if the bins have an equal size making it possible to compute the right bin by a mathematical formula. But, from the OP's sample data, the bins don't have all the same size.
| [reply] |
|
|
| [reply] |
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by hippo (Archbishop) on Aug 07, 2013 at 21:56 UTC
|
Your description suggests that your data are all integers - is this the case? If so there will be a lot of duplicate data points so you can first of all tot up all the same integers into a hash. Only once you have done that should you think about putting them into the bins.
You should then sort the keys of your totals hash and the keys of your bins hash so you only have to step through each hash once.
You could do this in any language, but perl will probably be as good as most. Have a look in your favourite algorithms book as this is not an uncommon sort of task and will be covered there in detail.
| [reply] |
|
|
Sounds good, I'll try this one and see what happens
| [reply] |
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by Laurent_R (Canon) on Aug 08, 2013 at 10:51 UTC
|
The data sample that you provided appears to be sorted in ascending order. If it is really the case of your data, then this provides a possible opportunity for vast performance improvements, which you waste if you store the data in a temporary hash (unsorted). The temporary hash is useless anyway and is a waste of time.
Assuming you are first storing your ranges in the form of upper bounds in a sorted array (@ranges), the loop could be very very fast, because you don't need to test each of the 50,000+ ranges, but just one for each input line. Try something like this (untested program, I can't test right now):
my %frequencies;
my $overlap;
foreach my $upper (@ranges) {
if (defined $overlap) {
next unless $overlap <= $upper;
$frequencies{$upper} ++; # $overlap is within the range
}
while (<$IN_FILE>) {
chomp;
if ($_ <= $upper) {
$frequencies{$upper} ++;
} else {
$overlap = $_;
last;
}
}
}
Even if your input file is not sorted, it is very likely that it will be faster to first sort your input and then use the algorithm above than having to check 50,000 ranges for each line (although a dichotomic search would of course reduce this burden).
Update Apr. 8, 16:45 UTC: corrected a small error in the program.
| [reply] [d/l] |
|
|
If you use a slightly different data structure, you could even use a modified binary search to find the proper bucket, resulting in log2(N) comparisons instead of N/2 comparisons (on average).
Update: Ignore this - Laurent_R's solution assumes an already sorted data set and just steps through it. $coffee++.
| [reply] [d/l] |
|
|
The algorithm I suggested is actually reading in parallel the data and the bin ranges. Therefore, it does just 1 comparison per line of data (plus one additional comparison each time the program crosses a bin range). The complexity is O(N). It will be very difficult to do anything better for N data records (at least if the data is sorted, which was my starting hypothesis; even if the data needs to be sorted, it is still likely to be the fastest solution if a good N Log N sorting program is used).
| [reply] |
|
|
|
|
Looks great, on my way of testing this solution out!!! And yes, unfortunately my bins don't have equal sizes, so this makes the thing more challenging. Thanks!
| [reply] |
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by choroba (Cardinal) on Aug 07, 2013 at 23:11 UTC
|
Crossposted at StackOverflow. Note that it is considered polite to inform about crossposting so people not attending both sites do not waste their efforts solving a problem already sorted out at the other end of the Internets.
| [reply] |
|
|
I think that the key is “inform about crossposting ...” since many of us frequent many sites. Also, if you do find a particularly-useful reply on another site, I also like it when people mention (and summarize) the relevant posts here along with a cross-link to their sources. (The summary being intended to maintain the information-value here no matter what “those other sites” might do, say, five or ten years from now ... if they still exist ;-) ... and if, ugh, I “still exist.”)
| |
|
|
Got it, and will keep it in mind next time I ask something :) Keeping good etiquette is always a plus, thanks!
| [reply] |
|
|
Indeed, and I'm sorry about that, however I got extra replies that could be as/more useful from this site, thanks PerlMonks!
| [reply] |