extract values from a field that are consecutive and within one standard deviation of each other

mulder4786 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: extract values from a field that are consecutive and within one standard deviation of each other by tangent (Parson) on Jul 14, 2015 at 02:40 UTC
Here is one way to do it. I have changed the block size and deviation, and added my own data, in order to get some output: use strict; use warnings; my @data; while (<DATA>) { my @fields = split; push(@data, \@fields ); } my $block_size = 4; my $deviation = 0.5; my $last = $data[0]; my @candidates = ( $last ); for my $i ( 1 .. $#data ) { my $current = $data[$i]; if ( within_range( $last->[3], $current->[3] ) ) { push( @candidates, $current ); } elsif ( @candidates >= $block_size ) { print_block( \@candidates ); @candidates = ( $current ); } else { @candidates = ( $current ); } $last = $current; } # deal with stragglers if ( @candidates >= $block_size ) { print_block( \@candidates ); } sub within_range { my ( $x, $y ) = @_; return ( abs( $x - $y ) > $deviation ? 0 : 1 ); } sub print_block { my ( $lines ) = @_; print "BLOCK\n"; for my $line ( @$lines ) { print join(' ', @$line ), "\n"; } } __DATA__ 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 1 14930 rs75454623 0.535714285714286 0.611111111111111 1 14933 rs199856693 0.0357142857142857 0.0555555555555556 1 14948 rs201855936 0.107142857142857 0 1 10492 rs55998931 1 0.4375 1 10492 rs55998931 1.5 0.4375 1 10492 rs55998931 1.9 0.4375 1 10492 rs55998931 2 0.4375 1 10492 rs55998931 2.6 0.4375 [download] Output: `BLOCK 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 1 14930 rs75454623 0.535714285714286 0.611111111111111 BLOCK 1 10492 rs55998931 1 0.4375 1 10492 rs55998931 1.5 0.4375 1 10492 rs55998931 1.9 0.4375 1 10492 rs55998931 2 0.4375` [download] Update: see Myrddin Wyllt's explanation below as to why this won't give the desired results.	[reply] [d/l] [select]
Re^2: extract values from a field that are consecutive and within one standard deviation of each other by Myrddin Wyllt (Hermit) on Jul 15, 2015 at 16:56 UTC
This will produce blocks where consecutive values are within range, but not necessarily where all values are within range - for example, the second block in the output ranges from 1 to 2 for a required maximum deviation of 0.5. To produce the desired result, you need to keep track of the maximum and minimum values within the candidate set, and test against these. Also, if you drop the whole candidate set as soon as the latest value doesn't fit, you will miss sequences which start in the middle of the current set but don't include the early values - you need to add the latest value to the end and then chop values off the beginning until you get back to a qualifying set. This may end up being just the latest value, but it could be larger. The following code shows one way to do this. It uses the same changed block size and deviation as tangent's example, and adds some more data to illustrate overlapping qualifying sets use strict; use warnings; # Set block size and deviation my $block_size = 4; my $deviation = 0.5; # Initialise the candidates, maximum and minimum values my $candidates = [[split(' ', <DATA>)]]; my $maxval = my $minval = $candidates->[0][3]; # Loop through the rest of the DATA while (<DATA>) { my $current = [split]; if (within_range($current->[3], $maxval, $minval)) { push(@$candidates, $current); $maxval = $current->[3] if $maxval < $current->[3]; $minval = $current->[3] if $minval > $current->[3]; } elsif (@$candidates >= $block_size) { print_block($candidates); push(@$candidates, $current); ($candidates, $maxval, $minval) = trim_candidates($candidates); } else { push(@$candidates, $current); ($candidates, $maxval, $minval) = trim_candidates($candidates); } } # deal with stragglers if ( @$candidates >= $block_size ) { print_block( $candidates ); } sub within_range { my ($testval, $testmax, $testmin) = @_; return 0 if $testmax - $testval > $deviation; return 0 if $testval - $testmin > $deviation; return 1; } sub print_block { my ( $lines ) = @_; print "BLOCK\n"; for my $line ( @$lines ) { print join(' ', @$line ), "\n"; } } sub trim_candidates { my $worklist = shift; # drop the first entry shift @$worklist; # Check if the remaining worklist qualifies my $workmax = my $workmin = $worklist->[0][3]; foreach my $item (@$worklist) { return trim_candidates($worklist) unless within_range($item->[3], $workmax, $workmin); $workmax = $item->[3] if $workmax < $item->[3]; $workmin = $item->[3] if $workmin > $item->[3]; } return ($worklist, $workmax, $workmin); } 0; __DATA__ 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 1 14930 rs75454623 0.535714285714286 0.611111111111111 1 14933 rs199856693 0.0357142857142857 0.0555555555555556 1 14948 rs201855936 0.107142857142857 0 1 10492 rs55998931 1 0.4375 1 10492 rs55998931 1.5 0.4375 1 10492 rs55998931 1.9 0.4375 1 10492 rs55998931 2 0.4375 1 10492 rs55998931 2.6 0.4375 1 13418 blah 20.0 blah 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah 1 13418 blah 20.7 blah 1 13418 blah 30.5 blah 1 13418 blah 30.0 blah 1 13418 blah 30.0 blah 1 13418 blah 30.5 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah [download] Output: BLOCK 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 BLOCK 1 13418 blah 20.0 blah 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah BLOCK 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah BLOCK 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah 1 13418 blah 20.7 blah BLOCK 1 13418 blah 30.5 blah 1 13418 blah 30.0 blah 1 13418 blah 30.0 blah 1 13418 blah 30.5 blah BLOCK 1 13418 blah 30.5 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah [download]	[reply] [d/l] [select]
Re^3: extract values from a field that are consecutive and within one standard deviation of each other by tangent (Parson) on Jul 16, 2015 at 14:31 UTC
Nicely done!	[reply]
Re^2: extract values from a field that are consecutive and within one standard deviation of each other by mulder4786 (Novice) on Jul 14, 2015 at 03:51 UTC
This is great, thank you.	[reply]
Re: extract values from a field that are consecutive and within one standard deviation of each other by Anonymous Monk on Jul 14, 2015 at 01:28 UTC
Sure, here you go, perlintro#Arrays, split, Path::Tiny#openr_raw, readline	[reply]