in reply to Re: Sliding window perl program
in thread Sliding window perl program

Hi, This program is for sliding window. I want to count the number of entries of file 2 , lies between the variable range (100 size window with 50 size slid) of sliding window from +500 to the -500 of 650. That is because I put a for loop of 100 increment and then tries to increase the $up number 50. i.e. 150-250

200-300

250-350

...

...

...

1100-1150

I hope u understand the question.

Replies are listed 'Best First'.
Re^3: Sliding window perl program
by Laurent_R (Canon) on Oct 21, 2015 at 23:18 UTC
    Well, since you insist on sliding windows, it appears that I missed your requirement, sorry about that. But I still really don't understand what you want.

    Perhaps a more detailed example of your input and desired output would help.

      Hi, Here is the explanation of task I want to perform. 1. match the value of column 2 of file 1 with the value of column 1 of file 2 (i.e. chr11). and then fetch all lines of file2 which have chr11 value.

      2. make the varables -500 and +500 of value 4 of file 1 (i.e. 650). From 150 to 1150 select a sliding window of 50 with window size 100. i.e. 150-250, 200-300, 250-350, 300-400....1050-1150.

      3. Count the lines of file2, on the basis of value in the second column of file2, which lies in these windows....

      4. For this examples... The answer should be

      150 250 0

      250 350 0

      350 450 0

      450 550 0

      550 650 5

      650 750 1

      750 850 0

      850 950 0

      950 1050 0

      1050 1150 0

      1150 1250 0

        Alright, now I understand what you want. I think I can come up with a simple solution a bit later (no time right now), but there seems to be still an inconsistency in your description of the requirement.

        In paragraph 2, your ranges are 150-250, 200-300, 250-350..., i.e. with an increment of 50 and an overlap between successive ranges.

        In the expected outcome you show, your ranges are 150-250, 250-350, 350-450..., i.e. an increment of 100 and no overlap.

        Can you please clarify?

        Consider this pseudo-solution, to be adapted with your exact requirement, not fully available as of this writing.

        Even though it is now a bit more complex, I stick to the idea of reading only once each of the two files, because it is usually much more efficient. So, after having read once file 1 and closed it, we need to read file 2 line by line and store in a nested data structure (probably a hash of arrays or a hash of hashes) the information collected. Once reading file 2 is completed, output the content of the data structure.

        I am only displaying below the second while loop of my previous code, as there is no need to change the first loop on file 1.

        my $margin = 500; open my $SC, "<", $file2 or die "Error: could not open $file2 $!"; my %result; my $step = 100; while (my $line2 = <$SC>) { my ($id, $val) = split /\t/, $line2; my $val_file1 = $hash{$id}; my ($low, $high) = ( $val_file1 - $margin, $val_file1 + $margin); next unless $val > $low and $val < $high; # value not within range +, just discard it my $delta = int (($val - $low)/$step); # delta : slot number $result{$id}{$delta}++; } close $SC; # now %result has, for each $ID, a frequency distribution by steps of +100 (slots 0 to 9), we just need to extract the data from it. for my $id (keys %result) { my $low = $hash{$id} - $margin; for my $slot (0..9){ my $range = sprintf "%d-%d", $low + $slot * $step, $low + ($sl +ot + 1) * $step; my $frequency = $result{$id}{$slot} // 0; print "ID $id: $range : $frequency \n"; } }
        I *think* it should work more or less the way you want, but I cannot currently test that code on my tablet, so there may be a typo or an error here or there, or possibly an off-by-one mistake somewhere, but I think the basic idea should be there and it should be easy to get it straight with just a bit of testing.

        If your "sliding windows" is different from what I have done, it should be just minor changes in the value of the params ($margin, $step) and perhaps a bit more work in the final printing of the results at the end, provided the %result hash has sufficiently detailed information.