in reply to Perl hit counter for multiple ranges, optimization or options in other languages?

The data sample that you provided appears to be sorted in ascending order. If it is really the case of your data, then this provides a possible opportunity for vast performance improvements, which you waste if you store the data in a temporary hash (unsorted). The temporary hash is useless anyway and is a waste of time.

Assuming you are first storing your ranges in the form of upper bounds in a sorted array (@ranges), the loop could be very very fast, because you don't need to test each of the 50,000+ ranges, but just one for each input line. Try something like this (untested program, I can't test right now):

my %frequencies; my $overlap; foreach my $upper (@ranges) { if (defined $overlap) { next unless $overlap <= $upper; $frequencies{$upper} ++; # $overlap is within the range } while (<$IN_FILE>) { chomp; if ($_ <= $upper) { $frequencies{$upper} ++; } else { $overlap = $_; last; } } }

Even if your input file is not sorted, it is very likely that it will be faster to first sort your input and then use the algorithm above than having to check 50,000 ranges for each line (although a dichotomic search would of course reduce this burden).

Update Apr. 8, 16:45 UTC: corrected a small error in the program.

  • Comment on Re: Perl hit counter for multiple ranges, optimization or options in other languages?
  • Download Code

Replies are listed 'Best First'.
Re^2: Perl hit counter for multiple ranges, optimization or options in other languages?
by MidLifeXis (Monsignor) on Aug 08, 2013 at 13:31 UTC

    If you use a slightly different data structure, you could even use a modified binary search to find the proper bucket, resulting in log2(N) comparisons instead of N/2 comparisons (on average).

    Update: Ignore this - Laurent_R's solution assumes an already sorted data set and just steps through it. $coffee++.

    --MidLifeXis

      The algorithm I suggested is actually reading in parallel the data and the bin ranges. Therefore, it does just 1 comparison per line of data (plus one additional comparison each time the program crosses a bin range). The complexity is O(N). It will be very difficult to do anything better for N data records (at least if the data is sorted, which was my starting hypothesis; even if the data needs to be sorted, it is still likely to be the fastest solution if a good N Log N sorting program is used).

        Quite right. Coffee deficiency.

        --MidLifeXis

Re^2: Perl hit counter for multiple ranges, optimization or options in other languages?
by SaktiMonk (Initiate) on Aug 08, 2013 at 17:19 UTC
    Looks great, on my way of testing this solution out!!! And yes, unfortunately my bins don't have equal sizes, so this makes the thing more challenging. Thanks!