kikuchiyo has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have to write a small application that visualizes 2D datasets, finds local maxima and minima in them and lets the user manage those maxima and minima interactively. The interactive GUI part comes later, this question concerns the maxima/minima finding algorithm itself.

Certain assumptions can be made about the data: it is somewhat periodic, but not in the strict sense of the word: the amplitude and the frequency of the pulses do vary. Also, the data is quite noisy. Think of an ECG curve or brain wave pattern to get the picture.

What I need is the following: I have to find only one local maximum and one local minimum in every pulse, that is, the highest and lowest points of every pulse. Because of the noisiness of the data, this is not that trivial: simply scanning through the data and marking every local peak/valley does not suffice, I have to filter them somehow.

I thought that I might exploit the periodic property of the data: First I'd find all peaks, then sort them according to their heights, then weed out those that are too close to a legitimate peak.

Here is what I have so far:

#!/usr/bin/perl use strict; use warnings; my (@x, @y); while (<>) { unless (/[^\d\.\s-]/) { my @f = split; push @x, $f[0]; foreach my $i (1..$#f) { push @{ $y[$i-1] }, $f[$i]; } } } my ($maxima, $minima) = peak_find($y[0]); foreach my $index (@$maxima) { print "$x[$index]\t$y[0][$index]\n"; } print "\n"; print "\n"; foreach my $index (@$minima) { print "$x[$index]\t$y[0][$index]\n"; } print "\n"; # Function that first finds all local maxima and minima in the data # (the algorithm comes from http://www.perlmonks.org/index.pl?node_id= +448913) # then weeds them out so that only one maximum and minimum remains for + each pulse. sub peak_find { #my $xa = shift; my $ya = shift; my $period = autocorr_firstpeak($ya); my (@maxima, @minima); push @maxima, undef; push @minima, undef; foreach my $i (1..$#$ya) { if ( $ya->[$i] > $ya->[$i-1] ) { $maxima[ -1 ] = $i; } elsif ( $maxima[ -1 ] ) { $maxima[ @maxima ] = undef; } if ( $ya->[$i] < $ya->[$i-1] ) { $minima[ -1 ] = $i; } elsif ( $minima[ -1 ] ) { $minima[ @minima ] = undef; } } pop @maxima unless $maxima[ -1 ]; pop @minima unless $minima[ -1 ]; # At this point, @maxima and @minima contain # all local maxima and minima of the data, respectively my %max_hash = map {$_ => $ya->[$_]} @maxima; my %min_hash = map {$_ => $ya->[$_]} @minima; my @sorted_max = sort { $max_hash{$b} <=> $max_hash{$a} } keys %ma +x_hash; my @sorted_min = sort { $min_hash{$a} <=> $min_hash{$b} } keys %mi +n_hash; my %filtered_max_hash; my %filtered_min_hash; my $factor = 0.7; foreach my $index (@sorted_max) { if (exists $max_hash{$index}) { $filtered_max_hash{$index} = $max_hash{$index}; foreach my $shift (0..int($period*$factor)) { delete $max_hash{$index + $shift} unless exists $filtered_max_hash{$index + $shift}; delete $max_hash{$index - $shift} unless exists $filtered_max_hash{$index - $shift}; } } } foreach my $index (@sorted_min) { if (exists $min_hash{$index}) { $filtered_min_hash{$index} = $min_hash{$index}; foreach my $shift (0..int($period*$factor)) { delete $min_hash{$index + $shift} unless exists $filtered_min_hash{$index + $shift}; delete $min_hash{$index - $shift} unless exists $filtered_min_hash{$index - $shift}; } } } my @filtered_maxima = sort { $a <=> $b } keys %max_hash; my @filtered_minima = sort { $a <=> $b } keys %min_hash; return \@filtered_maxima, \@filtered_minima; } # Returns the first peak of the autocorrelation function - # this corresponds to the highest frequency of periodicity in the data +. sub autocorr_firstpeak { my $xa = shift; my @a; my ($min, $max, $index); my $which = 1; # 1=min, 0=max; $a[0] = 0; for my $x (@$xa) { $a[0] += $x**2; } for my $i (1..$#$xa) { $a[$i] = 0; for my $j (0..$#$xa) { $a[$i] += $xa->[$j]*$xa->[$j-$i]; } if ($which) { if ($a[$i] < $a[$i-1]) { $min = $a[$i]; } else { $which = 0; } } else { if ($a[$i] > $a[$i-1]) { $max = $a[$i]; } else { $index = $i-1; last; } } } return $index; }


This seems to work.

However, I feel that it is quite inefficient - is there a way (are there ways) to make it better?

You can generate test data that sort of looks like the real thing with the following command:
perl -le'print $_,"\t",sin($_/10)+sin(($_/30))**2*0.2+sin($_/500)*0.4+ +4+(rand()+rand()+rand()+rand())/20 for 0..6000' > testdata.txt

Replies are listed 'Best First'.
Re: Finding local maxima/minima in noisy, pediodic data
by BrowserUk (Patriarch) on Oct 12, 2009 at 13:47 UTC

    This seems to work pretty well on any dataset produced by your generator, along with a few variations on that, that I've tried:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; use GD::Graph::lines; use GD::Graph::colour qw(:colours :lists :files :convert); my $graph = GD::Graph::lines->new( 6000, 768 ); die 'No data file specified' unless @ARGV and -e $ARGV[ 0 ]; open my $fh, '<', $ARGV[ 0 ] or die "Couldn't open $ARGV[ 0 ]"; my @data = ([],[]); push( @{ $data[ 0 ] }, $1 ), push( @{ $data[ 1 ] }, $2 ) while <$fh> =~ m[(\S+)\s+(\S+)]; close $fh; #pp \@data; <>; $graph->set( title => 'Y over X', 'bgclr' => 'white', 'transparent' => 0, x_label => 'X', x_max_value => 6000, x_tick_number => 6, y_label => 'Y', y_tick_number => 8, ) or die $graph->error; my @hilos; my $crossover = $data[1][0]; my $dir = $data[1][1] > $crossover; my( $max, $min ) = ( -1e308, 1e308 ); my $count = 0; for my $y ( @{ $data[ 1 ] } ) { ++$count; $max = $y if $y > $max; $min = $y if $y < $min; if( $dir and $y < $crossover ) { push @hilos, ( $max ) x $count; $count = 0; $dir = 0; $max = $crossover; } elsif( !$dir and $y > $crossover ) { push @hilos, ( $min ) x $count; $count = 0; $dir = 1; $min = $crossover; } } push @data, \@hilos; my $gd = $graph->plot( \@data ) or die $graph->error; open(IMG, '>800691.png') or die $!; binmode IMG; print IMG $gd->png; close IMG; system '800691.png';

    It should be reasonably efficient as there's no sorting involved, just a single pass through the data. Just supply the script with the name of the data file as its only argument. The maximas and minimas (do they have a collective name? extremas?), will be plotted as a square wave in green overlaying the data in red.

    It's easy to see datasets that it would screw up on, but it just depends how good a model your generator is?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks.

      I couldn't run your program because I don't have the GD::Graph module (I couldn't install it and don't have time right now to fix it), but I took the core loop of your program and adapted it. Well, it is not satisfactory.

      It incorrectly labels some extrema as true ones but also misses some of them.

      It turns out that my test data generator was too simple. Here is a real example.

        In the interim, I plotted the second and third columns against the index on separate graphs using my first algorithm and posted the results here & here respectively.

        Perhaps you could load one or both of these images into a simple graphics editor and annotate the misses and false hits for us. I can see for example that the last 5 periods on the first graph above have false hits,

        Update: but I didn't notice any misses? Now I have. The 12th period on the first graph is a miss.

        But maybe I'm plotting the wrong values? Maybe you are working with the midpoints of the 2nd & 3rd columns?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Could you explain how the 3 columns in this sample relate to the two column data in your ealier post?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding local maxima/minima in noisy, pediodic data
by moritz (Cardinal) on Oct 12, 2009 at 12:49 UTC

    There is a variety of things you can try:

    • Use a running average to smoothen your data. Find the peaks in the smoothed data, then go back to your original data and find just one peak close to a peak in the smoothed data
    • Make a mathematical model of your data, and do a least squares fit, or maybe a fit on a piece of the data
    • Another option to smoothen your data is to do a fourier transform, cut off high frequencies, and then transform back (though that's quite costly in terms of computation
    • You could try to re-use an existing audio codec that's based on a wavelet decomposition to do the hard work for you, it solves similar problems. You just have to find a clever way to interface such a codec.
    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Finding local maxima/minima in noisy, pediodic data
by NiJo (Friar) on Oct 12, 2009 at 19:24 UTC
    With a signal to noise ratio of 3 to 6 I'd have difficulties in drawing any conclusions from the data. Analytical chemistry uses S/N=3 for the detection limit and S/N=10 as a minimum for quantification. With a S/N of 3 your maxima/minima have an error of 33%. I'd focus on reducing (electrical?) noise before any data is acquired (Optimal A/D range?), but that is most likely not your job.

    The next best thing would be oversampling. 16 times oversampling (averaging 16 measurements) should give a 4-fold increase in S/N to a workable 12 to 24.

    With the current data set you have to face the uncertanty principle in x (correct frequencies, impeded by averaging) and y (correct amplitudes, improving via averaging). Classical approaches like derivatives produce too much noise. The only recognisable feature is the steep rise for each pulse. I'd declare a new pulse if the signal had a monotone increase for a certain number of samples and the combined increase has a S/N of at least 3.

Re: Finding local maxima/minima in noisy, pediodic data
by mpeg4codec (Pilgrim) on Oct 12, 2009 at 20:56 UTC
    Push the data through a low-pass filter:
    use constant { alpha => 0.1, }; my ($prev_x, $prev_y) = (0, 0); while (<>) { my ($t, $x, $y) = split; $x = $prev_x + alpha * ($x - $prev_x); $prev_x = $x; $y = $prev_y + alpha * ($y - $prev_y); $prev_y = $y; print "$t $x $y\n"; }
    Tweak the alpha to get at the desired frequency band.
      If I plot your test data up, there is a wave with a period of something like 3000, and there is a wave with a period of about 60. Is it the long period you are looking for extrema in?
      If I assume that it is the 3000 period wave you are looking for extrema in, I can see you keeping track of 2 quantities. One is the long term average, something which takes in more than 2 of the long term waves (hopefully much more than 2). The other is a moving average, of about 1/3 the period (so about 1000 points). Occasionally a person has to calculate the average of that 1000 points, but for many iterations it is probably easier to just use a circular buffer of all the points, and to calculate the new average you subtract out the value to be replaced and add in the new value. If the 1000 term MA is below the long term average, you are looking for minima. If the 1000 term MA is above the long term average, you are looking for a maxima.
        No, I'm interested in the extrema of the short wavelength pulses.

        However, I can use your suggestion about long term averages for that, too.

        A possible strategy: In the first pass I generate a smoothed dataset from the original using a moving average of about 200 points (or 2-3 periods of the short pulses). This will smoothen out the short wavelength pulses but retain the long term characteristics. Then I compare the original with this smoothed dataset point by point: if it is below the average I need to look for a minimum, if it is above, I look for a maximum.
Re: Finding local maxima/minima in noisy, pediodic data
by ambrus (Abbot) on Oct 13, 2009 at 09:55 UTC

    If the data is suitable, you can try to split the signal to half periods roughly by alternatly searching for the first point when the signal reaches an upper threshold and a lower threshold. After this, you just find the minimum or maximum in each half period.

    Let me show an example because my description is not clear. Assume the signal is like this:

    our @s = map { sin(0.18*$_) + rand(0.6) - 0.3 } 0 .. 99;
    Then you split the time to approximate half periods like this:
    our(@xlo, @xhi); my $t = my $t0 = $s[0] < 0; for my $x (0 .. @s - 1) { + if ($t) { if ($s[$x] > 0.5) { $t = 0; push @xhi, $x; } } else { if ( +$s[$x] < -0.5) { $t = 1; push @xlo, $x; } } }

    Now values in @xlo give the start of each lower half period, and similarly @xhi contain the starting time of each higher half period. So now you search for a minimum in each lower half period and the maximum in each upper half.

    our(@xmi, @xma); for my $p (0 .. @xlo - 1) { my $xmi = $xlo[$p]; for m +y $x ($xlo[$p] .. ($xhi[$p + $t0] || @s) - 1) { $s[$x] < $s[$xmi] and + $xmi = $x; } push @xmi, $xmi; } for my $p (0 .. @xhi - 1) { my $xma += $xhi[$p]; for my $x ($xhi[$p] .. ($xlo[$p + 1 - $t0] || @s) - 1) { +$s[$xma] < $s[$x] and $xma = $x; } push @xma, $xma; }

    (The code is complicated only because of the boundary conditions, furthermore it is left as an exercise to the reader to fix any off by one errors.)

    Let's plot the results

    for my $x (0 .. @s - 1) { my $l = " "x15 . ":" . " "x15; if (@xlo && $ +xlo[0] <= $x) { shift @xlo; substr $l, 0, 10, "v"x10; } elsif (@xhi & +& $xhi[0] <= $x) { shift @xhi; substr $l, 20, 10, "v"x10; } if (@xmi +and $xmi[0] <= $x) { shift @xmi; substr $l, 0, 7, "minimum"; } elsif +(@xma and $xma[0] <= $x) { shift @xma; substr $l, 23, 7, "maximum"; } + substr $l, 15+int(10*$s[$x]), 1, "*"; print $l, "\n"; }

    We get, for example, this.

    If the data is more noisy, you may need to smooth it before comparing with the thresholds. If the amplitude varies then it's not so easy to set a fix threshold this way, so I'm not sure what to do in that case.

      This may sound heretical, but I suggest that Perl may be the wrong tool for the specific job you want doing. Great for collecting the data to feed to a better tool and great for interpreting and/or displaying that tool's result, but not especially good for doing the signal processing. You can write signal processing code in Perl, or any other Turing complete language, but sometimes it's better not to. I learned this lesson through experience ...

      As others have said, you're asking a moderately hard question in DSP and, IMO, you should be using tools which have been designed specifically to perform efficient and robust DSP. There are a good number available, some which cost real money, some which are free (as in beer and/or speech) and some which are in between.

      You don't reveal the nature of the platform you want to run this on, your budget (time and money, remember) or your views on free / non-free software so it's difficult to give specific advice and targeted use of Google is highly recommended. That said, consider Matlab/Octave for turn-key solutions. I've learned a lot from Numerical Recipes and have used their code in some of my projects.

      Good luck!

      Paul

        Hey, I did not recommend perl for the tool, I just gave this toy example in perl to explain what method to use.

Re: Finding local maxima/minima in noisy, pediodic data
by kikuchiyo (Hermit) on Oct 14, 2009 at 17:17 UTC
    Thank you all for your comments.

    I ended up doing two smoothing passes before I attempt to find any extrema: one is smoothed twice by a running average with a window of 1.1*$period (which I find with a simplified version of the autocorrelation func.), while the second is smoothed with a running median algorithm. The first (almost) completely removes the pulses, leaving only the "carrier wave" in place, while the second preserves the shape of the pulses but removes the spiky noise from them.

    This gives me a robust crossover detection: if the running median is higher than the large scale smoothed wave, then I have to look for a maximum, else a minimum.

    This algorithm is somewhat complex, however, it is (or seems to be) robust enough.

    Ultimately, this algorithm will be the core of a Gtk2 GUI application that lets the user (my boss) visualize the data, find the peaks and mark/move/delete/add/export them at will. However, I'll write this later.

    As for Perl not being the best tool for this particular problem, or a particular class of DSP-related problems -- well, I hesitate to say it out loud on this hallowed ground, but yes, quite likely that is true. There maybe other languages, platforms or tools which may be more efficient or easier to use for this problem than Perl; however, Perl is something I'm comfortable with. I happen to like Perl. :)

    There is also the factor that I need my application to be portable, specifically, I develop it on linux but it'll have to run on windows. It is my experience that this works with Perl; I wouldn't be sure about other languages (for example, I never wrote anything for Windows in C).

    Additionally, Perl is free, which is important to me for both practical and ethical standpoints.

      You have your solution, but are you aware that there is an anomoly in your data (n=47.00) that doesn't stand up to statistical analysis? It's more than just a random discontinuity, it's almost like a chunk of data was deleted from the set.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.