Re: How to Determine Stable Y-Values... or Detect an Edge..?
by tilly (Archbishop) on Aug 19, 2009 at 07:03 UTC
|
If you're willing to learn a lot of theory, you could look up the wavelet packet transform. That transform has the very nice property that it is very good at finding boundaries in noisy data. Typical uses included picking phonemes out of a sound wave, filtering out white noise, and image recognition software.
It would therefore solve this problem nicely, but learning enough theory to properly understand and implement this solution may take you a considerable amount of time. (Think multiple weeks.) Sorry, but I don't know the theory. I did learn it at one point, but I forgot it somewhere in the last millennium.
If you're looking for easier heuristics I'd suggest something like this. Take the minimum and maximum values on the Y-axis. Suppose that range is 60-140. Divide the range into 20 pieces. (20 is arbitrary, try it with several other numbers until you get satisfactory results.) So you'd have intervals 60-64, 64-68, 68-72 and so on. Now make (overlapping) intervals of adjacent pairs. So you'd have 60-68, 64-72, 68-76 and so on. For each of those slices on the Y-axis there are one or more maximal intervals on the X-axis where you stay inside of the range. Compute all of those.
You now have a ton of overlapping intervals on the X-axis. Take out the longest interval you've got. That's a stable area. Then cut out the longest interval in what is left. And the largest that is left in that. And so on.
Then what you do is call all labels above a certain length stable, and you know their end points.
(The reason for making them overlapping is that if you have a stable interval varying around, say, 68, you want to notice it even if 68 is one of the boundaries of a vertical slice you're looking at.)
You'll have to play around with how many pieces to cut the vertical into, and how long a horizontal piece will be called stable. It won't be pretty code and you will need some playing around. But you will probably get somewhat reasonable results. And you won't have to learn a ton of math to do it properly. | [reply] |
Re: How to Determine Stable Y-Values... or Detect an Edge..?
by salva (Canon) on Aug 19, 2009 at 07:31 UTC
|
You can apply a low-pass filter to the data first, in order to remove the "noise".
Probably, there is already some module on CPAN that does it (look for Fourier transform or series). Otherwise, programming it from scratch should not be too difficult once you understand the maths behind ;-)
Once you have your data cleaned, looking for stability intervals would just be a matter of defining how to measure stability. For instance, you can set a maximum Y delta and search for intervals where delta_X > min_delta_X and delta_Y < max_delta_Y
In any case, make your data samples available to us so we can also play with them!
| [reply] |
|
|
I like the idea of a low pass filter, though that is not sufficient to solve this problem. But with a data set of this kind, a Fourier transform is a particularly poor choice. The reason is that the Fourier transform tries to approximate the data as a set of sine waves. Therefore if you have a jump anywhere in the data, then applying a Fourier transform, removing high order terms, and transforming back will leave you with severe visible "ringing" over the whole data set, which will obscure what he's looking for.
That's why I naturally jumped to wavelets, which localize the higher order terms so that removing noise in one part of the picture doesn't create it elsewhere. They do that by combining low-pass and high-pass filters to separate data into localized high frequency pieces, and spread out low frequency pieces.
The easiest kind of wavelet to use is the Haar wavelet, which is likely also most appropriate for this case. The Haar wavelet is just a bunch of square waves. If the number of points is a power of 2, then you'd calculate the Haar wavelet as follows. Take each pair of points. Calculate their average and their difference from the average. The differences are the highest frequency wavelets, so save them. You have half as many averages as you have points. Repeat the calculation on the averages to get the next smaller wavelets. Then the next smaller wavelets, etc. Finally you'll be left with an overall average.
This procedure works, and will work well with data sets with large piecewise horizontal chunks. However as described it will not find boundaries well that are not conveniently on a power of 2. Nor will it work well if the width of the images is not a power of 2.
Wavelet packet algorithms generalize this procedure by dynamically altering the details of wavelet shape to come up with a solution that optimizes some global goal. For instance you could vary the boundary of the square waves to concentrate as much energy as possible into the smallest number of waves possible. If you do that then the boundaries of the largest few wavelets will naturally fall on the intervals that they're trying to find, and the intervals will be recognizable by the interior pieces being very small.
I remember that much of what they do, but I don't remember how to dynamically calculate wavelet packets. :-(
| [reply] |
Re: How to Determine Stable Y-Values... or Detect an Edge..?
by BrowserUk (Patriarch) on Aug 19, 2009 at 13:54 UTC
|
I think a fairly straight-forward (modified) moving averages algorithm might suit your needs. The modification is to round the moving average to the nearest transition level. For example, if your transition levels are every 30 units as somewhat suggested by your description, (but not so much by the graphs), then you calculate the average for the last N values and then round that to the nearest transition level.
This code demonstrates the idea (using simulated data):
#! perl -slw
use strict;
use GD::Graph::lines;
use GD::Graph::colour qw(:colours :lists :files :convert);
use Data::Dump qw[ pp ];
use List::Util qw[ sum ];
our $MOVES ||= 4;
our $SRAND ||= 1;
srand $SRAND; ## Allow consistancy between runs for testing
my $y = 30 * int rand 6; ## A base level for the random data
my @data = map[], 1..4;
my @moving; ## to hold the last $MOVING values for averaging
## Every 3 minutes throughout a day
for my $x ( 0 .. (24 * 20 -1 ) ) {
## Generate Y values varies randomly around the current transition
+ level
## (which varies randomly see below)
my $actualY = $y + -30 + int( rand 60 );
push @{ $data[ 0 ] }, $x / 20; ## X values in decimal hours
push @{ $data[ 1 ] }, $y; ## display the random baseline in red
push @{ $data[ 2 ] }, $actualY; ## The actual values in green
## Store the LAST $MOVES values
push @moving, $actualY;
shift @moving if $#moving >= $MOVES ;
## Calculate the moving average
my $ave = sum( @moving ) / $MOVES;
## and round it to the nearest transition level
$ave = 30 * ( int( ( $ave + 15 ) / 30 ) );
push @{ $data[ 3 ] }, $ave; ## display it in blue
## Make a random change to the base level 20% of the time
next if rand > 0.2;
$y = ( int( $y / 30 ) + ( -1 + int rand 3 ) ) * 30;
$y = 30 if $y < 30;
$y = 150 if $y > 150;
}
#pp \@data; <>;
my $file = '789655.png';
my $graph = GD::Graph::lines->new(3000, 768);
$graph->set(
'bgclr' => 'white',
'transparent' => 0,
'interlaced' => 1,
title => 'Some simple graph',
x_label => 'X Label',
x_max_value => 24,
x_min_value => 0,
x_tick_number => 24,
y_label => 'Y label',
y_max_value => 180,
y_min_value => 0,
y_tick_number => 12,
y_label_skip => 2,
) or die $graph->error;
my $gd = $graph->plot(\@data) or die $graph->error;
open IMG, '>', $file or die $!;
binmode IMG;
print IMG $gd->png;
close IMG;
system $file; ## Load the graph into the local default image viewer
__END__
Usage: 789655.pl -MOVES=5 ##(less than 3 or > 10 not good )
When you run this, you'll see a different graph to me (different PRNG), but you should see a green line hopping wildly either side of a step-wise varying red line. These are the simulated data and actual random transition levels respectively.
The blue line--which should be tracking the red line fairly closely (though with lag)--is the rounded moving average calculated from the green data without reference to the red.
The higher you set -MOVES, the less likely you are to detect false edges, but the greater the lag before you detect them. 4 to 8 seems to work well depending upon your priorities. If you are doing this statically--ie. when you have all the data to hand, you can easily correct for the lag by a simple -X offset. If you are doing it in real time as the data is gathered, the lag could be a problem depending upon your responsiveness requirements.
There are some other interesting variations upon the above that might be applicable depending on the nature of the data and the use of the calculated values, but describing them all would be pointless. Maybe you can share that information with us?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
push @moving, $val;
next if $#moving < ($MOVES / 2);
shift @moving if @moving >= $MOVES;
Can you give me a quick explanation of what the middle line is doing? I'm not familiar with the '$#' construct.
b) You also have the following in your code:
my @rounds = $EXTRA ? ( 44, 90, 100, 110, 120 ) : ( 44, 90, 120 );
It would certainly simplify my job if I knew and could specify these values (that is, 44, 90 and 120)... but in many cases, I don't KNOW those values. Is there a means to simply determine those values from the data... or should I simply create 10 'bins' for example (10, 20, 30... 110, 120) and gradually refine them until I get the 3 'real' values?
I hope these questions are clear enough. It's just that although I've been playing about with perl for 15 years or so, there are still a LOT of basic constructs and things that I don't know about...!
Thanks again
| [reply] [d/l] [select] |
|
|
Can you give me a quick explanation of what the middle line is doing?
$#moving is the highest numbered element in the array--ie. one less than the total number of elements, as indexes start at zero.
The statement next if $#moving < ($MOVES / 2);, says: skip the moving average calculations until we have at least half as many values in the bank as the period of the moving average. Ie. If we are calculating the moving average over 50 values, don't start until we have at least 25 accumulated.
The effect of the line, is to correct for the lag that is normally associated with moving averages. Effectively adding a -X(P/2) correction to the moving averages trace, by discarding the first P/2 moving average values, which would be calculated from less than P/2 inputs and so be dubious anyway.
I don't KNOW those values. Is there a means to simply determine those values from the data...
First. It was not at all clear in your OP that you did not know these transitions. Quite the reverse actually. The entire emphasis of your post was how to determine the points of transition between (predetermined) "steady state" levels. Not determining those levels. Sorry if I misread your intent?
To answer the question, can they be determined from the data, I'll come back to the question from my earlier post. Are you doing this statically--ie. with all the data known up front--or dynamically--as the data is accumulated?
And add a second question: How will you access whether the levels you choose--whether by inspection or calculation--are the correct ones?
My point being that it is pretty clear if you inspect the frequency analysis of the values in your sample data:
C:\test>junk9
44 : 481 (43.18%)
90 : 176 (15.80%)
120 : 127 (11.40%)
100 : 56 (5.03%)
102 : 24 (2.15%)
101 : 21 (1.89%)
104 : 19 (1.71%)
99 : 17 (1.53%)
115 : 14 (1.26%)
98 : 13 (1.17%)
92 : 12 (1.08%)
107 : 12 (1.08%)
106 : 12 (1.08%)
97 : 12 (1.08%)
108 : 11 (0.99%)
95 : 10 (0.90%)
112 : 10 (0.90%)
105 : 10 (0.90%)
117 : 9 (0.81%)
118 : 8 (0.72%)
110 : 8 (0.72%)
109 : 8 (0.72%)
96 : 7 (0.63%)
111 : 7 (0.63%)
114 : 6 (0.54%)
113 : 5 (0.45%)
93 : 5 (0.45%)
103 : 4 (0.36%)
119 : 3 (0.27%)
91 : 3 (0.27%)
94 : 3 (0.27%)
116 : 1 (0.09%)
that if you want(need, desire, see) 3 transition levels, then 44, 90 & 120 are the ones to pick. If you want or need 4 values it is also fairly clear. But what about 5? Do you pick 101 or 102?
If you calculate these values somehow--say, using wavelets, or short-time Fourier transforms--then you are quite likely to get numbers resembling: 44.8972321, 91.00002, 102.87563, 119.0300657, or similar. That is, the calculations are likely to produce output values that don't actually appear in your input data. And if you attempt to round or truncate them to values that do appear, then you will likely end up with transition level values that are wholly unrepresentative of the inputs in terms of frequency--because that is the nature of the math.
So the question becomes: What are you going to do with these values? What use are you intending to make of them? How critical is the outcome? Will lives depend upon it? Or just the position of a line on a presentation graph? Without some clues as to the purpose, it is difficult to make suggestions!
Several possibilities come to mind that might be appropriate to some purposes.
- You might decide (arbitrarily) that you want 3(N) levels.
In which case, sort the inputs by frequency and pick the top 3(N).
- Or you might decide that any input that constitutes more than some (again arbitrary) percentage of the inputs, constitutes a level.
For the sample data, if you decided 10%, then you get the 3 transitions at 44, 90 & 120. If you decided 5%, then you gain an extra level at 100.
You might favour a selection policy based around the mean deviation.
So any value that falls with 1 standard deviation of the mean constitutes a level.
With a little more thought, I could come up with half a dozen other possibilities, but which if any is appropriate to your purpose, depends very much upon that purpose.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
Re: How to Determine Stable Y-Values... or Detect an Edge..?
by NiJo (Friar) on Aug 19, 2009 at 18:14 UTC
|
I'd define the stable states T1 and T2 by a moving box with an equilibration period and a delta. When the box was entered on the right side and left on the left side, the graph retrospectively is defined as stable on entering. When unstable, the box is exited on top or bottom. The delta could be a relative standard deviation for noise and maybe also a linear regression for drift.
The classic criterion for transition is the second derivative, but that heavily suffers from noise. | [reply] |
Re: How to Determine Stable Y-Values... or Detect an Edge..?
by ozboomer (Friar) on Aug 19, 2009 at 21:26 UTC
|
Many thanks to everyone for the suggestions so far; as expected, there's a lot of complicated details for me to digest here!
For those interested, I have uploaded some data, should you like to experiment. Have a look at sm_data.csv for a typical day's worth
of data, exactly as it comes from the source. NOTE that the time is hh:mm or similar...
| [reply] |
|
|
Try C&P'ing your csv file after the __DATA_ tag, and run it with -MOVES=10
Looking at the plot, there seem to be extra stable-ish plateaus at 100 and 110 as well as those at 44, 90 & 120. If you compare with and without those extra round points you'll see what I mean. Use -EXTRA on the command line to use them.
If you need just the 3 levels, the setting -MOVES=50 seems to produce good results.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: How to Determine Stable Y-Values... or Detect an Edge..?
by Anonymous Monk on Aug 19, 2009 at 14:03 UTC
|
From how you are describing your data (sorry, I still don't see pictures), your data is piecewise constant. At any given time step, you are comparing an established constant, to a hypothetical new constant of the last N values sampled. The object being that this new constant better explains the last N values than the currently accepted (long term) constant. Part of the problem, is figuring out what N is. As I have described it, this is a spline, and the problem with N is automatic (or variable) knot placement. I have seen some papers on selecting between piecewise constant and piecewise linear splines with variable knot placement, but off the top of my head I don't have URLs for you.
| [reply] |
Re: How to Determine Stable Y-Values... or Detect an Edge..?
by ozboomer (Friar) on Aug 24, 2009 at 00:28 UTC
|
The behaviour that we're looking at here is more-or-less defined by a set of 'black box' algorithms that we know precious little about within the software being used to control the Y-values in the process (this software is pretty old and has been through PDP, VAX and now PC platforms with only minor revisions(!)). There are actually many data inputs that affect the way the Y-values move but as they're in 'black boxes', we're more-or-less not interested in their workings and we're just looking at the output.
We start by defining some 'tuning parameters' for the process to use. For simplicity, let's call them A, B1, B2, C and D. Note that B2 is optional. A jumps directly to B1, B1 directly to B2... but after that, from B1 or B2 to C is a continuous distribution, as is from C to D. In the example data I've posted here, the values are: A=44, B1=90 (B2 is not used), C=100 and D=100 (Oop! C would normally be ~10% less than D but no matter).
When the process starts at the beginning of the day, these values, together with various other inputs are processed and the Y-value is produced. In some approximate way, depending on another few inputs, the next Y-value is calculated based on the previous Y-value, so we have something like: Y1 = f(Y0, A, B1, B2, C, D + others) and X1 = X0 + ~Y1.
When we do the analysis, we generally have a complete day's worth of data available... but in terms of tuning the operation (that is, adjusting the A, B1, B2, etc values), we might like to run the analysis on incomplete data; for example, once we get to 10:00am or so, we might run some analysis on today's data and re-define the A, B1, B2, etc values on the fly... but that's not the primary purpose of what my application is about.
After all this, we're basically looking at the Y-value range A to D (or the highest value achieved in the day OR 'so far') and we're trying to determine the point between the time where the 'rise' started in the morning and where the Y-values stabilized at a 'maximum value' in the morning. Once we determine that time range, we can make further decisions about whether to use the mid-point, the 3rd points or whatever, to say when the real 'load' on the system 'started'.
Once we have this basic algorithm worked out, we'd apply it to the other time period transitions throughout the day. So we'd end up having something like this:
AM Peak: Start Time (T1)
End Time (T2)
High Off-Peak: Start (T3 = T2)
End (T4)
PM Peak: Start (T5 = T4)
End (T6)
Out of all this, we're simply trying to determine a relatively constant 'start' and 'end' point for the various peak periods during the day, so that we can analyze maybe some months of data with a consistent 'start' and 'end' time.
We would also like to automate how we decide on the A, B1, B2, etc values in response to what the actual loading on the system is. At present, we make arbitrary decisions on what those values may be and they frequently don't give the performance (stability of Y-value) we expect at various times.
I hope this clarifies what we're trying to do...
Again, many thanks for everyone's input - it's been invaluable.
| [reply] [d/l] |