Perl hit counter for multiple ranges, optimization or options in other languages?

SaktiMonk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl hit counter for multiple ranges, optimization or options in other languages? by rjt (Curate) on Aug 07, 2013 at 22:41 UTC
You're asking for a frequency distribution. Statistics::Descriptive makes very quick work of this (and much more): `use Statistics::Descriptive; my @ranges = (19, 29, 39); # 1..19, 20..29, 30..39 my $stat = Statistics::Descriptive::Full->new; $stat->add_data( 12, 14, 15, 20, 21 ); my %dist = $stat->frequency_distribution(\@ranges); printf " %5d: %4d\n", $_, $dist{$_} for @ranges;` [download] Output: `19: 3 29: 2 39: 0` [download] `use strict; use warnings;` omitted for brevity.	[reply] [d/l] [select]
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by SaktiMonk (Initiate) on Aug 07, 2013 at 22:43 UTC
I didn't know about this module, but sounds like a great alternative for my problem, thanks!	[reply]
Re^3: Perl hit counter for multiple ranges, optimization or options in other languages? by locked_user sundialsvc4 (Abbot) on Aug 07, 2013 at 23:48 UTC
Definitely get to know this site: http://search.cpan.org. You will be using it quite constantly ...
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by SaktiMonk (Initiate) on Aug 10, 2013 at 01:25 UTC
worked out great, now the script runs much more smoothly, thanks!	[reply]
Re: Perl hit counter for multiple ranges, optimization or options in other languages? by Eily (Monsignor) on Aug 07, 2013 at 22:23 UTC
You could go faster by completing your rangeHash straightaway without going through dataHash first. Assuming your ranges can't overlap, all you have to do is compute the range from the value and use it as a key. For example: (the getRange sub is far from being perfect, I went for a simple solution since you probably already have code to check which range you are in) : `use Data::Dumper; sub getRange { my $value = shift; warn "Unexpected value $value," and return if $value < 1; return '1-19' if $value < 20; return '20-29' if $value < 30; return '30-39' if $value < 40; return '40-59' if $value < 60; warn "Out of range value $value," and return; } my %hits; while (<DATA>) { my $range = getRange($_); $hits{$range}++ if defined $range; } print Dumper \%hits; __DATA__ 12 14 15 20 21 50 42` [download] The result is : `$VAR1 = { '1-19' => 3, '40-59' => 2, '20-29' => 2 };` [download]	[reply] [d/l] [select]
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by SaktiMonk (Initiate) on Aug 07, 2013 at 22:42 UTC
Another good one! Thanks, will test this out!	[reply]
Re: Perl hit counter for multiple ranges, optimization or options in other languages? by kcott (Archbishop) on Aug 08, 2013 at 06:42 UTC
G'day SaktiMonk, Welcome to the monastery. Update: My original solution (now within `<spoiler>` tags below) didn't take into account uneven sized ranges. I didn't notice that but ++Laurent_R did (see below). Here's a new version that still uses an array (`@bins`) for the range counts; caches the results of determining the range; and still only uses one iteration per range to output the bin counts. `$ perl -Mstrict -Mwarnings -e ' use List::Util qw{first}; my @input = qw{12 14 15 20 21}; my @range_ends = qw{19 29 39}; my (@bins, %range_index_cache); sub get_range_index { ( first { $_[0] <= $range_ends[$_] } 0 .. $#range_ends ) // @r +ange_ends; } for (@input) { ++$bins[$range_index_cache{$_} //= get_range_index($_)]; } for (0 .. $#range_ends) { printf "%d-%d %d\n" => $_ ? $range_ends[$_ - 1] + 1 : 1, $range_ends[$_], $bins[$ +_] // 0; } ' 1-19 3 20-29 2 30-39 0` [download] Any datapoints beyond the ranges you specify are counted in the last element of `@bins`. <Reveal this spoiler or all in this thread> [Note:* If you're using an old version of Perl that doesn't have the `//` (Logical Defined-Or) operator, you'll need to use the defined function.]* -- Ken	[reply] [d/l] [select]
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by Laurent_R (Canon) on Aug 08, 2013 at 13:16 UTC
I also thought about this possible solution, which would be very fast (and easy), but that works only if the bins have an equal size making it possible to compute the right bin by a mathematical formula. But, from the OP's sample data, the bins don't have all the same size.	[reply]
Re^3: Perl hit counter for multiple ranges, optimization or options in other languages? by kcott (Archbishop) on Aug 09, 2013 at 03:33 UTC
++ Thanks for spotting that. I've updated my node. -- Ken	[reply]
Re: Perl hit counter for multiple ranges, optimization or options in other languages? by hippo (Archbishop) on Aug 07, 2013 at 21:56 UTC
Your description suggests that your data are all integers - is this the case? If so there will be a lot of duplicate data points so you can first of all tot up all the same integers into a hash. Only once you have done that should you think about putting them into the bins. You should then sort the keys of your totals hash and the keys of your bins hash so you only have to step through each hash once. You could do this in any language, but perl will probably be as good as most. Have a look in your favourite algorithms book as this is not an uncommon sort of task and will be covered there in detail.	[reply]
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by SaktiMonk (Initiate) on Aug 07, 2013 at 22:42 UTC
Sounds good, I'll try this one and see what happens	[reply]
Re: Perl hit counter for multiple ranges, optimization or options in other languages? by Laurent_R (Canon) on Aug 08, 2013 at 10:51 UTC
The data sample that you provided appears to be sorted in ascending order. If it is really the case of your data, then this provides a possible opportunity for vast performance improvements, which you waste if you store the data in a temporary hash (unsorted). The temporary hash is useless anyway and is a waste of time. Assuming you are first storing your ranges in the form of upper bounds in a sorted array (@ranges), the loop could be very very fast, because you don't need to test each of the 50,000+ ranges, but just one for each input line. Try something like this (untested program, I can't test right now): `my %frequencies; my $overlap; foreach my $upper (@ranges) { if (defined $overlap) { next unless $overlap <= $upper; $frequencies{$upper} ++; # $overlap is within the range } while (<$IN_FILE>) { chomp; if ($_ <= $upper) { $frequencies{$upper} ++; } else { $overlap = $_; last; } } }` [download] Even if your input file is not sorted, it is very likely that it will be faster to first sort your input and then use the algorithm above than having to check 50,000 ranges for each line (although a dichotomic search would of course reduce this burden). Update Apr. 8, 16:45 UTC: corrected a small error in the program.	[reply] [d/l]
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by MidLifeXis (Monsignor) on Aug 08, 2013 at 13:31 UTC
~~If you use a slightly different data structure, you could even use a modified binary search to find the proper bucket, resulting in log2(N) comparisons instead of N/2 comparisons (on average).~~ Update: Ignore this - Laurent_R's solution assumes an already sorted data set and just steps through it. `$coffee++`. --MidLifeXis	[reply] [d/l]
Re^3: Perl hit counter for multiple ranges, optimization or options in other languages? by Laurent_R (Canon) on Aug 08, 2013 at 14:44 UTC
The algorithm I suggested is actually reading in parallel the data and the bin ranges. Therefore, it does just 1 comparison per line of data (plus one additional comparison each time the program crosses a bin range). The complexity is O(N). It will be very difficult to do anything better for N data records (at least if the data is sorted, which was my starting hypothesis; even if the data needs to be sorted, it is still likely to be the fastest solution if a good N Log N sorting program is used).	[reply]
Re^4: Perl hit counter for multiple ranges, optimization or options in other languages? by MidLifeXis (Monsignor) on Aug 08, 2013 at 14:48 UTC
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by SaktiMonk (Initiate) on Aug 08, 2013 at 17:19 UTC
Looks great, on my way of testing this solution out!!! And yes, unfortunately my bins don't have equal sizes, so this makes the thing more challenging. Thanks!	[reply]
Re: Perl hit counter for multiple ranges, optimization or options in other languages? by choroba (Cardinal) on Aug 07, 2013 at 23:11 UTC
Crossposted at StackOverflow. Note that it is considered polite to inform about crossposting so people not attending both sites do not waste their efforts solving a problem already sorted out at the other end of the Internets. لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by locked_user sundialsvc4 (Abbot) on Aug 07, 2013 at 23:54 UTC
I think that the key is “inform about crossposting ...” since many of us frequent many sites. Also, if you do find a particularly-useful reply on another site, I also like it when people mention (and summarize) the relevant posts here along with a cross-link to their sources. (The summary being intended to maintain the information-value here no matter what “those other sites” might do, say, five or ten years from now ... if they still exist `;-)` ... and if, ugh, I “still exist.”)
Re^3: Perl hit counter for multiple ranges, optimization or options in other languages? by SaktiMonk (Initiate) on Aug 08, 2013 at 16:34 UTC
Got it, and will keep it in mind next time I ask something :) Keeping good etiquette is always a plus, thanks!	[reply]
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages? by SaktiMonk (Initiate) on Aug 07, 2013 at 23:46 UTC
Indeed, and I'm sorry about that, however I got extra replies that could be as/more useful from this site, thanks PerlMonks!	[reply]