Re^4: Sliding window perl program

Hi, Here is the explanation of task I want to perform. 1. match the value of column 2 of file 1 with the value of column 1 of file 2 (i.e. chr11). and then fetch all lines of file2 which have chr11 value.

2. make the varables -500 and +500 of value 4 of file 1 (i.e. 650). From 150 to 1150 select a sliding window of 50 with window size 100. i.e. 150-250, 200-300, 250-350, 300-400....1050-1150.

3. Count the lines of file2, on the basis of value in the second column of file2, which lies in these windows....

4. For this examples... The answer should be

150 250 0

250 350 0

350 450 0

450 550 0

550 650 5

650 750 1

750 850 0

850 950 0

950 1050 0

1050 1150 0

1150 1250 0

Comment on Re^4: Sliding window perl program

Replies are listed 'Best First'.
Re^5: Sliding window perl program by Laurent_R (Canon) on Oct 22, 2015 at 18:56 UTC
Alright, now I understand what you want. I think I can come up with a simple solution a bit later (no time right now), but there seems to be still an inconsistency in your description of the requirement. In paragraph 2, your ranges are 150-250, 200-300, 250-350..., i.e. with an increment of 50 and an overlap between successive ranges. In the expected outcome you show, your ranges are 150-250, 250-350, 350-450..., i.e. an increment of 100 and no overlap. Can you please clarify?	[reply]
Re^5: Sliding window perl program by Laurent_R (Canon) on Oct 22, 2015 at 21:20 UTC
Consider this pseudo-solution, to be adapted with your exact requirement, not fully available as of this writing. Even though it is now a bit more complex, I stick to the idea of reading only once each of the two files, because it is usually much more efficient. So, after having read once file 1 and closed it, we need to read file 2 line by line and store in a nested data structure (probably a hash of arrays or a hash of hashes) the information collected. Once reading file 2 is completed, output the content of the data structure. I am only displaying below the second while loop of my previous code, as there is no need to change the first loop on file 1. my $margin = 500; open my $SC, "<", $file2 or die "Error: could not open $file2 $!"; my %result; my $step = 100; while (my $line2 = <$SC>) { my ($id, $val) = split /\t/, $line2; my $val_file1 = $hash{$id}; my ($low, $high) = ( $val_file1 - $margin, $val_file1 + $margin); next unless $val > $low and $val < $high; # value not within range +, just discard it my $delta = int (($val - $low)/$step); # delta : slot number $result{$id}{$delta}++; } close $SC; # now %result has, for each $ID, a frequency distribution by steps of +100 (slots 0 to 9), we just need to extract the data from it. for my $id (keys %result) { my $low = $hash{$id} - $margin; for my $slot (0..9){ my $range = sprintf "%d-%d", $low + $slot * $step, $low + ($sl +ot + 1) * $step; my $frequency = $result{$id}{$slot} // 0; print "ID $id: $range : $frequency \n"; } } [download] I think it should work more or less the way you want, but I cannot currently test that code on my tablet, so there may be a typo or an error here or there, or possibly an off-by-one mistake somewhere, but I think the basic idea should be there and it should be easy to get it straight with just a bit of testing. If your "sliding windows" is different from what I have done, it should be just minor changes in the value of the params ($margin, $step) and perhaps a bit more work in the final printing of the results at the end, provided the %result hash has sufficiently detailed information.	[reply] [d/l]