Re^2: Optimizing a CHI-based data throttler

Now, performance obviously depends on CHI's search/get/set and I don't know what that is (can't see that from a diagonal look at the documentation).

But something obvious is that Data::Throttler uses the notion of buckets, which are counters for a time range. And from what I can gather it uses an array to store those buckets. But you (CHI that is) use a hashtable (I guess?). So there is definetely a penalty there in terms of amortized performance but definetely it is not accounting for the order of magnitude you observe. But, still, the advantage of Data::Throttler is that it uses data structures and algorithms for that specific use-case (of summing buckets). It pays to be specific.

So, it costs time to create the list of indices of items, fetch them and then sum them, as you do. Data::Throttler will only sum a continuous range (i.e. a loop) in its internal buckets array. That's a 3rd of your cost already. If CHI had a method to apply an operation (sum) over a list of indices or better still a continuous range of items, you could cut your costs a bit. But then that's not something one would expect from cache storage!

The "really bad" news is that Data::Throttler can optimise its internal logic even more. And also add more features. Can you keep up with that by using a generic tool?

Having said all that, I can think of one way to improve performance (but then Data::Throttler can also do that and be ahead, again). Instead of having each bucket storing a count of hits over its time-interval, make it keep an incremental count of hits. I.e. keep the sum of counts in each bucket of all its previous buckets (not the dead ones). In this way you do not need to do a sum each time try_push() is called. You will have that sum ready waiting for you in the last bucket. However, that sum will be invalid once a bucket is too old and had to be removed. Whenever a bucket gets too old, you have to subtract its count from the total count. So, create a function total_count() which either re-calculates the count if buckets were invalidated since last time it was called, or returns a cached count. Re-calculating is really easy: subtract the difference of the old bucket's count to its newer neighbour from all buckets. Then, if a hit is added, you only need to add it to the newest bucket. The total count will be held by the newest bucket and will need to be re-calculated if you ask for it after time-interval seconds since last time you called it. So, here is finally some use of cache :)

bw, bliako

Comment on Re^2: Optimizing a CHI-based data throttler Select or Download Code