mulder4786 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, new here. I have a file that looks like:

1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 1 14930 rs75454623 0.535714285714286 0.611111111111111 1 14933 rs199856693 0.0357142857142857 0.0555555555555556 1 14948 rs201855936 0.107142857142857 0

I would like to extract rows where the values in field 4 are within one standard deviation of each other (or some other fixed threshold such as within 0.07 of each other) for at least 10 consecutive rows. Therefore the output would have blocks of >=10 consecutive rows where the values in field 4 are close in range within each block. Can anyone point me in the right direction?

  • Comment on extract values from a field that are consecutive and within one standard deviation of each other
  • Download Code

Replies are listed 'Best First'.
Re: extract values from a field that are consecutive and within one standard deviation of each other
by tangent (Parson) on Jul 14, 2015 at 02:40 UTC
    Here is one way to do it. I have changed the block size and deviation, and added my own data, in order to get some output:
    use strict; use warnings; my @data; while (<DATA>) { my @fields = split; push(@data, \@fields ); } my $block_size = 4; my $deviation = 0.5; my $last = $data[0]; my @candidates = ( $last ); for my $i ( 1 .. $#data ) { my $current = $data[$i]; if ( within_range( $last->[3], $current->[3] ) ) { push( @candidates, $current ); } elsif ( @candidates >= $block_size ) { print_block( \@candidates ); @candidates = ( $current ); } else { @candidates = ( $current ); } $last = $current; } # deal with stragglers if ( @candidates >= $block_size ) { print_block( \@candidates ); } sub within_range { my ( $x, $y ) = @_; return ( abs( $x - $y ) > $deviation ? 0 : 1 ); } sub print_block { my ( $lines ) = @_; print "BLOCK\n"; for my $line ( @$lines ) { print join(' ', @$line ), "\n"; } } __DATA__ 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 1 14930 rs75454623 0.535714285714286 0.611111111111111 1 14933 rs199856693 0.0357142857142857 0.0555555555555556 1 14948 rs201855936 0.107142857142857 0 1 10492 rs55998931 1 0.4375 1 10492 rs55998931 1.5 0.4375 1 10492 rs55998931 1.9 0.4375 1 10492 rs55998931 2 0.4375 1 10492 rs55998931 2.6 0.4375
    Output:
    BLOCK 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 1 14930 rs75454623 0.535714285714286 0.611111111111111 BLOCK 1 10492 rs55998931 1 0.4375 1 10492 rs55998931 1.5 0.4375 1 10492 rs55998931 1.9 0.4375 1 10492 rs55998931 2 0.4375
    Update: see Myrddin Wyllt's explanation below as to why this won't give the desired results.

      This will produce blocks where consecutive values are within range, but not necessarily where all values are within range - for example, the second block in the output ranges from 1 to 2 for a required maximum deviation of 0.5.

      To produce the desired result, you need to keep track of the maximum and minimum values within the candidate set, and test against these.

      Also, if you drop the whole candidate set as soon as the latest value doesn't fit, you will miss sequences which start in the middle of the current set but don't include the early values - you need to add the latest value to the end and then chop values off the beginning until you get back to a qualifying set. This may end up being just the latest value, but it could be larger.

      The following code shows one way to do this. It uses the same changed block size and deviation as tangent's example, and adds some more data to illustrate overlapping qualifying sets

      use strict; use warnings; # Set block size and deviation my $block_size = 4; my $deviation = 0.5; # Initialise the candidates, maximum and minimum values my $candidates = [[split(' ', <DATA>)]]; my $maxval = my $minval = $candidates->[0][3]; # Loop through the rest of the DATA while (<DATA>) { my $current = [split]; if (within_range($current->[3], $maxval, $minval)) { push(@$candidates, $current); $maxval = $current->[3] if $maxval < $current->[3]; $minval = $current->[3] if $minval > $current->[3]; } elsif (@$candidates >= $block_size) { print_block($candidates); push(@$candidates, $current); ($candidates, $maxval, $minval) = trim_candidates($candidates); } else { push(@$candidates, $current); ($candidates, $maxval, $minval) = trim_candidates($candidates); } } # deal with stragglers if ( @$candidates >= $block_size ) { print_block( $candidates ); } sub within_range { my ($testval, $testmax, $testmin) = @_; return 0 if $testmax - $testval > $deviation; return 0 if $testval - $testmin > $deviation; return 1; } sub print_block { my ( $lines ) = @_; print "BLOCK\n"; for my $line ( @$lines ) { print join(' ', @$line ), "\n"; } } sub trim_candidates { my $worklist = shift; # drop the first entry shift @$worklist; # Check if the remaining worklist qualifies my $workmax = my $workmin = $worklist->[0][3]; foreach my $item (@$worklist) { return trim_candidates($worklist) unless within_range($item->[3], $workmax, $workmin); $workmax = $item->[3] if $workmax < $item->[3]; $workmin = $item->[3] if $workmin > $item->[3]; } return ($worklist, $workmax, $workmin); } 0; __DATA__ 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 1 14930 rs75454623 0.535714285714286 0.611111111111111 1 14933 rs199856693 0.0357142857142857 0.0555555555555556 1 14948 rs201855936 0.107142857142857 0 1 10492 rs55998931 1 0.4375 1 10492 rs55998931 1.5 0.4375 1 10492 rs55998931 1.9 0.4375 1 10492 rs55998931 2 0.4375 1 10492 rs55998931 2.6 0.4375 1 13418 blah 20.0 blah 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah 1 13418 blah 20.7 blah 1 13418 blah 30.5 blah 1 13418 blah 30.0 blah 1 13418 blah 30.0 blah 1 13418 blah 30.5 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah

      Output:

      BLOCK 1 10492 rs55998931 0.272727272727273 0.4375 1 13418 . 0.25 0.0625 1 13752 . 0.153846153846154 0.25 1 13813 . 0.0357142857142857 0.2 1 13838 . 0.0357142857142857 0.2 1 14907 rs79585140 0.5 0.555555555555556 BLOCK 1 13418 blah 20.0 blah 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah BLOCK 1 13418 blah 20.1 blah 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah BLOCK 1 13418 blah 20.2 blah 1 13418 blah 20.3 blah 1 13418 blah 20.4 blah 1 13418 blah 20.5 blah 1 13418 blah 20.6 blah 1 13418 blah 20.7 blah BLOCK 1 13418 blah 30.5 blah 1 13418 blah 30.0 blah 1 13418 blah 30.0 blah 1 13418 blah 30.5 blah BLOCK 1 13418 blah 30.5 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah 1 13418 blah 30.6 blah 1 13418 blah 30.9 blah
        Nicely done!

      This is great, thank you.

Re: extract values from a field that are consecutive and within one standard deviation of each other
by Anonymous Monk on Jul 14, 2015 at 01:28 UTC