intelligent re-sampling of data

spurperl has asked for the wisdom of the Perl Monks concerning the following question:

In this node I complained about Excel's ancient limitations that make charting large data sets with it impossible.

Various solutions were proposed, but I was looking for something quick and as non-drastic as possible. Moving away from Excel is not an option, at the moment, for various reasons.

So, I decided to resample the data. Having N data points when N > MAX_LIMIT (about 32k for Excel charts) I have to resample the data in such a way as to display it in less than MAX_LIMIT points.

Now, naive sampling doesn't work here because this way it's easy to lose interesting information (assume I have all the data 0 and a single 1 somewhere. I want to see it, and using a trivial resampling algorithm the chance for seeing it is small).

The solution I've currently set on is a "maximum filter". I divide the data to "windows" and take one "representable" value from the window - the maximal data value.

This works very nicely since we don't have negative data, but I wonder what are the more general algorithms for arbitrary data ?

After all, Excel and other plotters use it. Even if I plot only 32000 points, there still are only ~1000 pixels in my screen so Excel does some resampling of its own.

Comment on intelligent re-sampling of data

Replies are listed 'Best First'.
Re: intelligent re-sampling of data by BrowserUk (Patriarch) on Jun 27, 2005 at 11:42 UTC
Even if I plot only 32000 points, there still are only ~1000 pixels in my screen so Excel does some resampling of its own. I doubt that Excel is doing any resampling the data. It is probably plotting all the points you supply, even if it means that many pixels are "redrawn" many times as a result of the plotting of the data. When you shrink or stretch the window in which the chart is drawn, Excel does not go back and resample the data, it simply applies a different final matrix transformation (SetWorldTranform)to the lines and points that make up the chart in order to scale them to the device space alloted. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply]
Re: intelligent re-sampling of data by Ido (Hermit) on Jun 27, 2005 at 11:06 UTC
I don't know if there's a general algorithm, but your saying that Excel should also do the same resampling, made me think "So GD::Graph should as well". Only, it's far easier to see how GD::Graph does things. So, maybe you'd like to take a look at the sub "val_to_pixel", in the source of GD::Graph::XYlines... HTH..	[reply]
Re: intelligent re-sampling of data by mattr (Curate) on Jun 28, 2005 at 07:00 UTC
If you're asking about perl this and above post on GD are relevant. Get the data out of Excel and reprocess it yourself. Then either make a new excel sheet with condensed data, or draw your own chart in GD. I have had good results with parsing data from up to 1MB size spreadsheets (the parsing takes a little while at that size). I am plotting just a certain window in time though as bar charts with plain vanilla GD. You could do lots of processing, but your eye is really good at picking things out too. So instead of figuring out how to reduce the amount of data I would like to suggest that you simply plot all of the data, perhaps with an appropriate transformation (stretch) to emphasize what you are looking for. If it is just 1 and 0 well that's not so bad. But if it is fine differences then try and expand the spectrum of color you use too.. Once you have a finely drawn chart, explore it with a program that lets you change gamut and zoom in if you like, there may be a lot of detail you can't see easily but will become uncovered by changing the palette. For example xephem may show you what kinds of things are possible (I have the comet on my mind). It can find dim stars, and invert or add tails to make them easier to see. Or the gimp, though I have in an image processing program called NIH Image (or NIH ImageJ for windows/linux/osx). It's not just for biotech, and has an astronomy file as a sample which may be applicable to your problem. It has some interesting features although I don't think it lets you drag and adjust the lookup table (GLUT) in realtime anymore like photoshop does. Something that lets you do that and maybe zoom or view from different angles (if you have a 3d box for example) can help you find patterns extremely quickly. Possibly an opengl model of it could be interesting too (I'm thinking of Wx::GLCanvas though search.cpan.org searching for "opengl" gave some likely suspects.) There's undoubtedly a lot of other stats programs out there that could be used, like R statistics language with the IDPmisc libary package for display of large datasets (adds effects to the display). And there is IBM's OpenDX (Open Visualization Data Explorer, which is open source). The Colormap Editor in the screenshot on that page is what I was talking about anyway it is a very cool system, only a 40 megabyte download away!	[reply]