SaktiMonk has asked for the wisdom of the Perl Monks concerning the following question:

Oh great Monks, I'm coming for your advice.

I have written a perl script that counts the number of hits of specific numbers into user-defined bins. For example, this is my data file:

12
14
15
20
21

And I want to know how many hits I have in the following ranges:

1-19
20-29
30-39

So results would be like

1-19 3
20-29 2
30-39 0

I have done such a thing by fist saving my data into a hash (datahash), then saving my ranges into another hash (rangehash), and then basically going over all the data points in datahash and checking that the value falls within the ranges of the rangehash.

The problem is that for each datapoint in datahash, I loop through all the rangehash values and exit once I find the range where the datapoint falls. This is good for few data points, but now I'm having files with at least 2 million datapoints and 50,000 ranges, so looping through all of that just takes forever.

I was wondering if anyone would have better (more optimal) solution rather than just looping through the whole thing. Suggestions for other languages are well received!!!

Best,

Sakti
  • Comment on Perl hit counter for multiple ranges, optimization or options in other languages?

Replies are listed 'Best First'.
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by rjt (Curate) on Aug 07, 2013 at 22:41 UTC

    You're asking for a frequency distribution. Statistics::Descriptive makes very quick work of this (and much more):

    use Statistics::Descriptive; my @ranges = (19, 29, 39); # 1..19, 20..29, 30..39 my $stat = Statistics::Descriptive::Full->new; $stat->add_data( 12, 14, 15, 20, 21 ); my %dist = $stat->frequency_distribution(\@ranges); printf " %5d: %4d\n", $_, $dist{$_} for @ranges;

    Output:

    19: 3 29: 2 39: 0
    use strict; use warnings; omitted for brevity.
      I didn't know about this module, but sounds like a great alternative for my problem, thanks!
      worked out great, now the script runs much more smoothly, thanks!
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by Eily (Monsignor) on Aug 07, 2013 at 22:23 UTC

    You could go faster by completing your rangeHash straightaway without going through dataHash first.

    Assuming your ranges can't overlap, all you have to do is compute the range from the value and use it as a key. For example: (the getRange sub is far from being perfect, I went for a simple solution since you probably already have code to check which range you are in) :

    use Data::Dumper; sub getRange { my $value = shift; warn "Unexpected value $value," and return if $value < 1; return '1-19' if $value < 20; return '20-29' if $value < 30; return '30-39' if $value < 40; return '40-59' if $value < 60; warn "Out of range value $value," and return; } my %hits; while (<DATA>) { my $range = getRange($_); $hits{$range}++ if defined $range; } print Dumper \%hits; __DATA__ 12 14 15 20 21 50 42
    The result is :
    $VAR1 = { '1-19' => 3, '40-59' => 2, '20-29' => 2 };

      Another good one! Thanks, will test this out!
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by kcott (Archbishop) on Aug 08, 2013 at 06:42 UTC

    G'day SaktiMonk,

    Welcome to the monastery.

    Update: My original solution (now within <spoiler> tags below) didn't take into account uneven sized ranges. I didn't notice that but ++Laurent_R did (see below). Here's a new version that still uses an array (@bins) for the range counts; caches the results of determining the range; and still only uses one iteration per range to output the bin counts.

    $ perl -Mstrict -Mwarnings -e ' use List::Util qw{first}; my @input = qw{12 14 15 20 21}; my @range_ends = qw{19 29 39}; my (@bins, %range_index_cache); sub get_range_index { ( first { $_[0] <= $range_ends[$_] } 0 .. $#range_ends ) // @r +ange_ends; } for (@input) { ++$bins[$range_index_cache{$_} //= get_range_index($_)]; } for (0 .. $#range_ends) { printf "%d-%d %d\n" => $_ ? $range_ends[$_ - 1] + 1 : 1, $range_ends[$_], $bins[$ +_] // 0; } ' 1-19 3 20-29 2 30-39 0

    Any datapoints beyond the ranges you specify are counted in the last element of @bins.

    [Note: If you're using an old version of Perl that doesn't have the // (Logical Defined-Or) operator, you'll need to use the defined function.]

    -- Ken

      I also thought about this possible solution, which would be very fast (and easy), but that works only if the bins have an equal size making it possible to compute the right bin by a mathematical formula. But, from the OP's sample data, the bins don't have all the same size.

        ++ Thanks for spotting that. I've updated my node.

        -- Ken

Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by hippo (Archbishop) on Aug 07, 2013 at 21:56 UTC

    Your description suggests that your data are all integers - is this the case? If so there will be a lot of duplicate data points so you can first of all tot up all the same integers into a hash. Only once you have done that should you think about putting them into the bins.

    You should then sort the keys of your totals hash and the keys of your bins hash so you only have to step through each hash once.

    You could do this in any language, but perl will probably be as good as most. Have a look in your favourite algorithms book as this is not an uncommon sort of task and will be covered there in detail.

      Sounds good, I'll try this one and see what happens
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by Laurent_R (Canon) on Aug 08, 2013 at 10:51 UTC

    The data sample that you provided appears to be sorted in ascending order. If it is really the case of your data, then this provides a possible opportunity for vast performance improvements, which you waste if you store the data in a temporary hash (unsorted). The temporary hash is useless anyway and is a waste of time.

    Assuming you are first storing your ranges in the form of upper bounds in a sorted array (@ranges), the loop could be very very fast, because you don't need to test each of the 50,000+ ranges, but just one for each input line. Try something like this (untested program, I can't test right now):

    my %frequencies; my $overlap; foreach my $upper (@ranges) { if (defined $overlap) { next unless $overlap <= $upper; $frequencies{$upper} ++; # $overlap is within the range } while (<$IN_FILE>) { chomp; if ($_ <= $upper) { $frequencies{$upper} ++; } else { $overlap = $_; last; } } }

    Even if your input file is not sorted, it is very likely that it will be faster to first sort your input and then use the algorithm above than having to check 50,000 ranges for each line (although a dichotomic search would of course reduce this burden).

    Update Apr. 8, 16:45 UTC: corrected a small error in the program.

      If you use a slightly different data structure, you could even use a modified binary search to find the proper bucket, resulting in log2(N) comparisons instead of N/2 comparisons (on average).

      Update: Ignore this - Laurent_R's solution assumes an already sorted data set and just steps through it. $coffee++.

      --MidLifeXis

        The algorithm I suggested is actually reading in parallel the data and the bin ranges. Therefore, it does just 1 comparison per line of data (plus one additional comparison each time the program crosses a bin range). The complexity is O(N). It will be very difficult to do anything better for N data records (at least if the data is sorted, which was my starting hypothesis; even if the data needs to be sorted, it is still likely to be the fastest solution if a good N Log N sorting program is used).

      Looks great, on my way of testing this solution out!!! And yes, unfortunately my bins don't have equal sizes, so this makes the thing more challenging. Thanks!
Re: Perl hit counter for multiple ranges, optimization or options in other languages?
by choroba (Cardinal) on Aug 07, 2013 at 23:11 UTC
    Crossposted at StackOverflow. Note that it is considered polite to inform about crossposting so people not attending both sites do not waste their efforts solving a problem already sorted out at the other end of the Internets.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      I think that the key is “inform about crossposting ...” since many of us frequent many sites.   Also, if you do find a particularly-useful reply on another site, I also like it when people mention (and summarize) the relevant posts here along with a cross-link to their sources.   (The summary being intended to maintain the information-value here no matter what “those other sites” might do, say, five or ten years from now ... if they still exist ;-) ... and if, ugh, I “still exist.”)

        Got it, and will keep it in mind next time I ask something :) Keeping good etiquette is always a plus, thanks!
      Indeed, and I'm sorry about that, however I got extra replies that could be as/more useful from this site, thanks PerlMonks!