in reply to How to Determine Stable Y-Values... or Detect an Edge..?

I think a fairly straight-forward (modified) moving averages algorithm might suit your needs. The modification is to round the moving average to the nearest transition level. For example, if your transition levels are every 30 units as somewhat suggested by your description, (but not so much by the graphs), then you calculate the average for the last N values and then round that to the nearest transition level.

This code demonstrates the idea (using simulated data):

#! perl -slw use strict; use GD::Graph::lines; use GD::Graph::colour qw(:colours :lists :files :convert); use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; our $MOVES ||= 4; our $SRAND ||= 1; srand $SRAND; ## Allow consistancy between runs for testing my $y = 30 * int rand 6; ## A base level for the random data my @data = map[], 1..4; my @moving; ## to hold the last $MOVING values for averaging ## Every 3 minutes throughout a day for my $x ( 0 .. (24 * 20 -1 ) ) { ## Generate Y values varies randomly around the current transition + level ## (which varies randomly see below) my $actualY = $y + -30 + int( rand 60 ); push @{ $data[ 0 ] }, $x / 20; ## X values in decimal hours push @{ $data[ 1 ] }, $y; ## display the random baseline in red push @{ $data[ 2 ] }, $actualY; ## The actual values in green ## Store the LAST $MOVES values push @moving, $actualY; shift @moving if $#moving >= $MOVES ; ## Calculate the moving average my $ave = sum( @moving ) / $MOVES; ## and round it to the nearest transition level $ave = 30 * ( int( ( $ave + 15 ) / 30 ) ); push @{ $data[ 3 ] }, $ave; ## display it in blue ## Make a random change to the base level 20% of the time next if rand > 0.2; $y = ( int( $y / 30 ) + ( -1 + int rand 3 ) ) * 30; $y = 30 if $y < 30; $y = 150 if $y > 150; } #pp \@data; <>; my $file = '789655.png'; my $graph = GD::Graph::lines->new(3000, 768); $graph->set( 'bgclr' => 'white', 'transparent' => 0, 'interlaced' => 1, title => 'Some simple graph', x_label => 'X Label', x_max_value => 24, x_min_value => 0, x_tick_number => 24, y_label => 'Y label', y_max_value => 180, y_min_value => 0, y_tick_number => 12, y_label_skip => 2, ) or die $graph->error; my $gd = $graph->plot(\@data) or die $graph->error; open IMG, '>', $file or die $!; binmode IMG; print IMG $gd->png; close IMG; system $file; ## Load the graph into the local default image viewer __END__ Usage: 789655.pl -MOVES=5 ##(less than 3 or > 10 not good )

When you run this, you'll see a different graph to me (different PRNG), but you should see a green line hopping wildly either side of a step-wise varying red line. These are the simulated data and actual random transition levels respectively.

The blue line--which should be tracking the red line fairly closely (though with lag)--is the rounded moving average calculated from the green data without reference to the red.

The higher you set -MOVES, the less likely you are to detect false edges, but the greater the lag before you detect them. 4 to 8 seems to work well depending upon your priorities. If you are doing this statically--ie. when you have all the data to hand, you can easily correct for the lag by a simple -X offset. If you are doing it in real time as the data is gathered, the lag could be a problem depending upon your responsiveness requirements.

There are some other interesting variations upon the above that might be applicable depending on the nature of the data and the use of the calculated values, but describing them all would be pointless. Maybe you can share that information with us?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP PCW It is as I've been saying!(Audio until 20090817)

Replies are listed 'Best First'.
Re^2: How to Determine Stable Y-Values... or Detect an Edge..?
by ozboomer (Friar) on Aug 21, 2009 at 06:07 UTC

    Just thinking out loud about this note... and I have a couple of basic questions...

    a) In your code:

    push @moving, $val; next if $#moving < ($MOVES / 2); shift @moving if @moving >= $MOVES;

    Can you give me a quick explanation of what the middle line is doing? I'm not familiar with the '$#' construct.

    b) You also have the following in your code:

    my @rounds = $EXTRA ? ( 44, 90, 100, 110, 120 ) : ( 44, 90, 120 );

    It would certainly simplify my job if I knew and could specify these values (that is, 44, 90 and 120)... but in many cases, I don't KNOW those values. Is there a means to simply determine those values from the data... or should I simply create 10 'bins' for example (10, 20, 30... 110, 120) and gradually refine them until I get the 3 'real' values?

    I hope these questions are clear enough. It's just that although I've been playing about with perl for 15 years or so, there are still a LOT of basic constructs and things that I don't know about...!

    Thanks again

      1. Can you give me a quick explanation of what the middle line is doing?

        $#moving is the highest numbered element in the array--ie. one less than the total number of elements, as indexes start at zero.

        The statement next if $#moving < ($MOVES / 2);, says: skip the moving average calculations until we have at least half as many values in the bank as the period of the moving average. Ie. If we are calculating the moving average over 50 values, don't start until we have at least 25 accumulated.

        The effect of the line, is to correct for the lag that is normally associated with moving averages. Effectively adding a -X(P/2) correction to the moving averages trace, by discarding the first P/2 moving average values, which would be calculated from less than P/2 inputs and so be dubious anyway.

      2. I don't KNOW those values. Is there a means to simply determine those values from the data...

        First. It was not at all clear in your OP that you did not know these transitions. Quite the reverse actually. The entire emphasis of your post was how to determine the points of transition between (predetermined) "steady state" levels. Not determining those levels. Sorry if I misread your intent?

        To answer the question, can they be determined from the data, I'll come back to the question from my earlier post. Are you doing this statically--ie. with all the data known up front--or dynamically--as the data is accumulated?

        And add a second question: How will you access whether the levels you choose--whether by inspection or calculation--are the correct ones?

        My point being that it is pretty clear if you inspect the frequency analysis of the values in your sample data:

        C:\test>junk9 44 : 481 (43.18%) 90 : 176 (15.80%) 120 : 127 (11.40%) 100 : 56 (5.03%) 102 : 24 (2.15%) 101 : 21 (1.89%) 104 : 19 (1.71%) 99 : 17 (1.53%) 115 : 14 (1.26%) 98 : 13 (1.17%) 92 : 12 (1.08%) 107 : 12 (1.08%) 106 : 12 (1.08%) 97 : 12 (1.08%) 108 : 11 (0.99%) 95 : 10 (0.90%) 112 : 10 (0.90%) 105 : 10 (0.90%) 117 : 9 (0.81%) 118 : 8 (0.72%) 110 : 8 (0.72%) 109 : 8 (0.72%) 96 : 7 (0.63%) 111 : 7 (0.63%) 114 : 6 (0.54%) 113 : 5 (0.45%) 93 : 5 (0.45%) 103 : 4 (0.36%) 119 : 3 (0.27%) 91 : 3 (0.27%) 94 : 3 (0.27%) 116 : 1 (0.09%)

        that if you want(need, desire, see) 3 transition levels, then 44, 90 & 120 are the ones to pick. If you want or need 4 values it is also fairly clear. But what about 5? Do you pick 101 or 102?

        If you calculate these values somehow--say, using wavelets, or short-time Fourier transforms--then you are quite likely to get numbers resembling: 44.8972321, 91.00002, 102.87563, 119.0300657, or similar. That is, the calculations are likely to produce output values that don't actually appear in your input data. And if you attempt to round or truncate them to values that do appear, then you will likely end up with transition level values that are wholly unrepresentative of the inputs in terms of frequency--because that is the nature of the math.

        So the question becomes: What are you going to do with these values? What use are you intending to make of them? How critical is the outcome? Will lives depend upon it? Or just the position of a line on a presentation graph? Without some clues as to the purpose, it is difficult to make suggestions!

        Several possibilities come to mind that might be appropriate to some purposes.

        • You might decide (arbitrarily) that you want 3(N) levels.

          In which case, sort the inputs by frequency and pick the top 3(N).

        • Or you might decide that any input that constitutes more than some (again arbitrary) percentage of the inputs, constitutes a level.

          For the sample data, if you decided 10%, then you get the 3 transitions at 44, 90 & 120. If you decided 5%, then you gain an extra level at 100.

          You might favour a selection policy based around the mean deviation.

          So any value that falls with 1 standard deviation of the mean constitutes a level.

        With a little more thought, I could come up with half a dozen other possibilities, but which if any is appropriate to your purpose, depends very much upon that purpose.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Many thanks for the explanation about the segments of code I couldn't understand; I'll certainly use those features myself now :)

        Apologies for not making my intentions clear in the OP; the original intent was to just 'walk' the stream of data and determine when it 'became somehow stable' and that would define a certain Y-value that I would use later on...

        ...but if it would make things simpler/'better'(?), let me explain something in a newer posting (see below).