in reply to Sorting Data By Overlapping Intervals

I would propose two modifications. First, when you load the data from file, already extract the fourth column and store it alongside the lines:

my @SNPs = map { [ (split /\t/)[3], $_ ] } <CG>;

So each element of @SNPs is now an array reference, whose first element is the fourth column and the second element is the full line.

As the second change, in your loop over the intervals pick all elements that fall in this interval using grep and the extract the line from the array reference using map:

my @inInterval = map { $_->[1] } grep { $start <= $_->[0] and $_->[0] +<= $end } @SNPs;

All you need then is to print these lines into the relevant file.

I am not sure whether I explain this well...

Replies are listed 'Best First'.
Re^2: Sorting Data By Overlapping Intervals
by ccelt09 (Sexton) on Oct 31, 2013 at 10:59 UTC

    the logic behind this makes sense but once I have each element of  @SNPs stored as an array reference as you explained above i don't understand how to print those falling within the ranges in my second data set to a relevant file

      This is what my second proposal does. If you have the interval boundaries in variables $start and $end, then

      my @inInterval = map { $_->[1] } grep { $start <= $_->[0] and $_->[0] +<= $end } @SNPs;

      will filter all relevant lines for this interval. You would just print OUT @inInterval; where OUT is the file handle for the file corresponding to this interval.

      Something like this:

      open my $CG, "<", $cg_input or die "can't open $cg_input\n"; my @SNPs = map { [ (split /\t/)[3], $_ ] } <$CG>; close($CG); open my $INTERVAL, "<", $input_interval or die "can't open $input_inte +rval\n"; my $interval = <$INTERVAL>; # skip first line foreach (<$INTERVAL>){ chomp; my( $start, $end ) = split /\t/; open my $OUT, ">", $output_directory."temp_file_".$count++.".txt"; + print $OUT map { $_->[1] } grep { $start <= $_->[0] and $_->[0] <= + $end } @SNPs; close $OUT; } close($INTERVAL);