RFC: Statistics::KernelEstimation - Kernel Density Estimates and Histograms

I would like to invite comments on a new module, named Statistics::KernelEstimation.

This modules calculates Kernel Density Estimates and related quantities for a collection of random points.

A Kernel Density Estimate (KDE) is similar to a histogram, but improves on two known problems of histograms: it is smooth (whereas a histogram is ragged) and does not suffer from ambiguity in regards to the placement of bins.

In a KDE, a smooth, strongly peaked function is placed at the location of each point in the collection, and the contributions from all points is summed. The resulting function is a smooth approximation to the probability density from which the set of points was drawn.

This module calculates KDEs as well as Cumulative Density Functions (CDF). Three different kernels are available (Gaussian, Box, Epanechnikov).

The module also includes limited support for bandwidth optimization.

Finally, the module can generate "classical" histograms and distribution functions.

The full POD is available here:

Documentation for Statistics::KernelEstimation

Let me know what you think!

Comment on RFC: Statistics::KernelEstimation - Kernel Density Estimates and Histograms

Replies are listed 'Best First'.
Re: RFC: Statistics::KernelEstimation - Kernel Density Estimates and Histograms by moritz (Cardinal) on Nov 23, 2008 at 22:10 UTC
A few random thoughts (note that I don't know very much about KDE, so some of this might be way off): How hard would it be to add a user-defined kernel, either through a callback or by passing an array(ref) that contains a discretized version of the kernel? That would make it even more useufl From a quick glance over the docs it seems that there's no easy way to calculate the `pdf` and `cdf` functions for all useful positions, the user always has to iterate over the interesting positions. Maybe there could be a function to calculate them all in a given range, and return the values as an array? You could think about a nice PDL integration.	[reply] [d/l] [select]
Re^2: RFC: Statistics::KernelEstimation - Kernel Density Estimates and Histograms by janert (Sexton) on Nov 23, 2008 at 23:12 UTC
Many thanks for your comments. Let me address them: User-defined Kernel: I debated that with myself. It would be really easy, since the choice of kernel function is implemented in terms of refs to functions, anyway. However, the protocol that the user-supplied kernel function has to adhere to is a bit larger than one thinks (it's not just the interface, but it also has to be normalized, and the user has to supply its integral as well for use with the CDF, and possibly the 2nd derivative, for use with the bandwidth optimization). What is more, the choice of kernel function is not really that critical - all kernels give more or less the same results. And the two most useful and most popular ones are the Gaussian and the Epanechnikov kernel, which are included. So, with those considerations, it seemed as if allowing for user-defined kernel functions leads to considerable added complexity, but not enough added benefit. Therefore I decided against it. (And if somebody really needs an additional kernel, they can always derive their own subclass from this module, providing the new kernel in the implementation!) Interesting Points as Array: In principle I like the idea, but the problem is the definition of "interesting". That really depends on what the user wants to do with the data! Also, evaluating either PDF or CDF is expensive, therefore I wanted to leave it to the user to determine the step-width for the iteration (if you don't need precision, you get it faster!). Integration with PDL: That's an interesting idea. I need to look into that. Again, good comments. Thanks a lot! I hope my replies make sense.	[reply]
Re: RFC: Statistics::KernelEstimation - Kernel Density Estimates and Histograms by etj (Deacon) on Jun 06, 2022 at 00:24 UTC
A link to the MetaCPAN page: Statistics::KernelEstimation. A quick scan of the source code suggests it wouldn't be insanely hard to make a PDL version, mainly by replacing the `for` loops with array-programming constructs. (It'd be shorter, too)	[reply] [d/l]


Perl: the Markov chain saw
	PerlMonks