sesemin has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I am running to this problem explained below. I would appreciate if you could give me your ideas and inputs to solve the following situation. Here is my data structure.

The fields are tab delimited. Ignore the alignment. The last column numbers are aligned in the original file.

Name Pos gc score present_absent CLS_S3_Contig1000 1 0 . . . . . . . . . . . . CLS_S3_Contig1000 67 0 CLS_S3_Contig1000 68 0 CLS_S3_Contig1000 69 0 CLS_S3_Contig1000 70 0 CLS_S3_Contig1000 71 0 CLS_S3_Contig1000 72 0 CLS_S3_Contig1000 73 0 CLS_S3_Contig1000 74 0 CLS_S3_Contig1000 75 0 CLS_S3_Contig1000 76 0 CLS_S3_Contig1000 77 9 0.4324 1 CLS_S3_Contig1000 78 0 CLS_S3_Contig1000 79 8 0.3904 1 CLS_S3_Contig1000 80 0 CLS_S3_Contig1000 81 9 0.452 1 CLS_S3_Contig1000 82 0 CLS_S3_Contig1000 83 11 0.4807 1 CLS_S3_Contig1000 84 0 CLS_S3_Contig1000 85 10 0.4892 1 CLS_S3_Contig1000 86 0 CLS_S3_Contig1000 87 10 0.5234 1 CLS_S3_Contig1000 88 0 CLS_S3_Contig1000 89 11 0.5374 1 CLS_S3_Contig1000 90 0 CLS_S3_Contig1000 91 10 0.5249 1 CLS_S3_Contig1000 92 0 CLS_S3_Contig1000 93 11 0.5553 1 <p> #The name is changing here and it will change over and over <p> CLS_S3_Contig10007 1 0 . . . . . . . . . . . . CLS_S3_Contig10007 43 0 CLS_S3_Contig10007 44 0 CLS_S3_Contig10007 45 0 CLS_S3_Contig10007 46 0 CLS_S3_Contig10007 47 0 CLS_S3_Contig10007 48 0 CLS_S3_Contig10007 49 0 CLS_S3_Contig10007 50 0 CLS_S3_Contig10007 51 0 CLS_S3_Contig10007 52 0 CLS_S3_Contig10007 53 10 0.4316 1 CLS_S3_Contig10007 54 0 CLS_S3_Contig10007 55 12 0.5951 1 CLS_S3_Contig10007 56 0 CLS_S3_Contig10007 57 0 CLS_S3_Contig10007 58 0 CLS_S3_Contig10007 59 12 0.5825 1 CLS_S3_Contig10007 60 0 CLS_S3_Contig10007 61 12 0.4785 1 CLS_S3_Contig10007 62 0 CLS_S3_Contig10007 63 14 0.566 1 CLS_S3_Contig10007 64 0 CLS_S3_Contig10007 65 14 0.4894 1 CLS_S3_Contig10007 66 0 CLS_S3_Contig10007 67 15 0.4756 1 CLS_S3_Contig10007 68 0

This is what I want to do. If present_absent column is ==1 for plus/minus 8 positions put present_absent ==1. For any positions 1-8 if present_absent column==0 skip it, if not put for positions 1-8 present_absent column==1 For max position number if present_absent column is 0 then for max - 8 put present_absebt colum ==1

After these conditions the data should look like this.

Name Pos gc score present_absent CLS_S3_Contig1000 1 0 . . . . . . . . . . . . CLS_S3_Contig1000 67 0 CLS_S3_Contig1000 68 0 CLS_S3_Contig1000 69 1 CLS_S3_Contig1000 70 1 CLS_S3_Contig1000 71 1 CLS_S3_Contig1000 72 1 CLS_S3_Contig1000 73 1 CLS_S3_Contig1000 74 1 CLS_S3_Contig1000 75 1 CLS_S3_Contig1000 76 1 CLS_S3_Contig1000 77 9 0.4324 1 CLS_S3_Contig1000 78 1 CLS_S3_Contig1000 79 8 0.3904 1 CLS_S3_Contig1000 80 1 CLS_S3_Contig1000 81 9 0.452 1 CLS_S3_Contig1000 82 1 CLS_S3_Contig1000 83 11 0.4807 1 CLS_S3_Contig1000 84 1 CLS_S3_Contig1000 85 10 0.4892 1 CLS_S3_Contig1000 86 1 CLS_S3_Contig1000 87 10 0.5234 1 CLS_S3_Contig1000 88 1 CLS_S3_Contig1000 89 11 0.5374 1 CLS_S3_Contig1000 90 1 CLS_S3_Contig1000 91 10 0.5249 1 CLS_S3_Contig1000 92 1 CLS_S3_Contig1000 93 11 0.5553 1 #The name is changing here and it will change over and over CLS_S3_Contig10007 1 0 . . . . . . . . . . . . CLS_S3_Contig10007 42 0 CLS_S3_Contig10007 43 0 CLS_S3_Contig10007 44 0 CLS_S3_Contig10007 45 1 CLS_S3_Contig10007 46 1 CLS_S3_Contig10007 47 1 CLS_S3_Contig10007 48 1 CLS_S3_Contig10007 49 1 CLS_S3_Contig10007 50 1 CLS_S3_Contig10007 51 1 CLS_S3_Contig10007 52 1 CLS_S3_Contig10007 53 10 0.4316 1 CLS_S3_Contig10007 54 1 CLS_S3_Contig10007 55 12 0.5951 1 CLS_S3_Contig10007 56 1 CLS_S3_Contig10007 57 1 CLS_S3_Contig10007 58 1 CLS_S3_Contig10007 59 12 0.5825 1 CLS_S3_Contig10007 60 1 CLS_S3_Contig10007 61 12 0.4785 1 CLS_S3_Contig10007 62 1 CLS_S3_Contig10007 63 14 0.566 1 CLS_S3_Contig10007 64 1 CLS_S3_Contig10007 65 14 0.4894 1 CLS_S3_Contig10007 66 1 CLS_S3_Contig10007 67 15 0.4756 1 CLS_S3_Contig10007 68 1

Replies are listed 'Best First'.
Re: Hash_of_Hash_Would do it?
by cormanaz (Deacon) on Sep 15, 2008 at 20:29 UTC
    Um...so what's the question? You've described the logic, tho I'm not too sure what you mean by "positions." Anyway since you understand it, what's stopping you from just codinf it up?
      Hi Cormanaz, Thanks for the reply. As you can see the third column is called position. Starts from 1 to some number (varies among different names - 1st col).

      I am rephrasing my question. I need to have "1"s where the difference between two adjacent positions is less than 24.

        Maybe you need to simplify the question and what you want done a bit.
Re: Hash_of_Hash_Would do it?
by AZed (Monk) on Sep 15, 2008 at 22:19 UTC

    Let me see if I correctly understand what this file is trying to represent before trying to give any further advice:

    You have, in effect, a hash mapping a name to an array of hashes, where the key to the array is 'pos' (starting at 1, not 0), and the hash inside the array contains 'gc', 'score' and 'present_absent'.

    If you parsed each line via something like:

    my ($name,$pos,$gc,$score,$presentabsent) = parse($current_line);
    you would end up with assignments like:
    my @posarray; my %clsdata; $posarray[$pos] = { 'gc' => $gc, 'score' => $score, 'present_absent' = +> $presentabsent }; $clsdata{$name} = \@posarray;

    Then you want to loop through all of your %clsdata hashes, and for each one, loop through the posarray, and for each member:

    1. If the key is <= 8 and present_absent is nonzero, set present_absent to 1
    2. If the key is >8, check the present_absent values for all keys from key-8 to key+8, and if they are all nonzero, set the present_absent value for that key to 1 (question: what do you do when key > $#posarray-8?)
    3. If the key == $#posarray and the value is 0, then set the value of key-8 to 1, otherwise leave it alone.

    Is that about what you're looking to do? Can you be more specific about exactly which part you're having trouble with?

    Update: Or, grandfather can completely solve your problem while I'm still working out what the question is, heh.

Re: Hash_of_Hash_Would do it?
by Cristoforo (Curate) on Sep 16, 2008 at 00:28 UTC
    I thought I would provide another approach that doesn't use a hash. It processes line by line.

    Chris

    #!/usr/bin/perl use strict; use warnings; chomp(my @previous = split /\t/, <DATA>); my $from = 1; my $to = $previous[2]-9 > 0 ? $previous[2]-9 : 0; # print '0 values' up to 8 less of margin for my $pos ($from..$to) { print join("\t", $previous[1], $pos, '','','0'), "\n"; } $from = $previous[2]-8 > 0 ? $previous[2]-8 : 1; $to = $previous[2]-1; # print margin of '1 values' up to first record for my $pos ($from .. $to) { print join("\t", $previous[1], $pos, '','','1'), "\n"; } # print first record print join("\t", @previous[1..4], '1'), "\n"; my $pos_count = $previous[2]; while (<DATA>) { chomp; my @current= split /\t/; if ($current[1] eq $previous[1]) { if (++$pos_count != $current[2]) { for my $pos ($pos_count .. $current[2]-1) { print join("\t", $current[1], $pos, '','', '1'), "\n"; } } print join("\t", @current[1..4], 1), "\n"; } else { # print '1 values' for margin of 8 past last pos for previous +record for my $pos ($previous[2] + 1 .. $previous[2] + 8) { print join("\t", $previous[1], $pos, '','','1'), "\n"; } $from = 1; $to = $current[2]-9 > 0 ? $current[2]-9 : 0; # print '0 values' up to 8 less of margin for current record for my $pos ($from .. $to) { print join("\t", $current[1], $pos, '','','0'), "\n"; } $from = $current[2]-8 > 0 ? $current[2]-8 : 1; $to = $current[2]-1; # print '1 values' up to current record for my $pos ($from .. $to) { print join("\t", $current[1], $pos, '','','1'), "\n"; } print join("\t", @current[1..4], 1), "\n"; } $pos_count = $current[2]; @previous = @current; } # after last record printed, print out margin of 8 more for my $pos ($previous[2] + 1 .. $previous[2] + 8) { print join("\t", $previous[1], $pos, '','','1'), "\n"; } __DATA__ CLS_S3_Contig100_st CLS_S3_Contig100 53 10 0.3717 CLS_S3_Contig100_at CLS_S3_Contig100 55 11 0.4321 CLS_S3_Contig100_st CLS_S3_Contig100 57 10 0.3223 CLS_S3_Contig100_at CLS_S3_Contig100 59 11 0.4055 CLS_S3_Contig100_st CLS_S3_Contig100 61 11 0.4511 CLS_S3_Contig100_at CLS_S3_Contig100 63 11 0.474 CLS_S3_Contig10031_st CLS_S3_Contig10031 53 12 0.5548 CLS_S3_Contig10031_st CLS_S3_Contig10031 57 10 0.4871 CLS_S3_Contig10031_st CLS_S3_Contig10031 61 12 0.547 CLSS3627.b1_F19.ab1 CLS_S3_Contig10031 62 11 0.5129 CLSS3627.b1_F19.ab1 CLS_S3_Contig10031 64 11 0.5789
    It printed out this solution.

      Hi Chris,

      Thanks for your time and solution. I ran your code and it works great as long as the gap in PIP is not large. In another word if PIP jumps from 240 to 280, it fills all the new col with "1" whereas we want it fill up to 248 from one margin and 272 from the other margin.

      for my $pos ($previous[2] + 1 .. $previous[2] + 8) { print join("\t", $previous[1], $pos, '','','1'), "\n"; }
      is this the place to control the loop. Not pass more than 8?

      Thank you again.

      Pedro

      You would replace
      if (++$pos_count != $current[2]) { for my $pos ($pos_count .. $current[2]-1) { print join("\t", $current[1], $pos, '','', '1'), "\n"; } }

      with

      if (++$pos_count != $current[2]) { fill_interval($pos_count, @current); }

      where fill_interval() is defined as

      sub fill_interval { my ($pos_count, @current) = @_; my $margin = 8; if ($current[2] - $pos_count <= 2*$margin) { for my $pos ($pos_count .. $current[2]-1) { print join("\t", $current[1], $pos, '','', '1'), "\n"; } } else { my @bool; my ($start, $end) = ($pos_count, $current[2]-1); for my $i (0..$margin-1) { @bool[ $start + $i, $end - $i ] = (1,1); } for my $pos ($pos_count .. $current[2]-1) { print join("\t", $current[1], $pos, '','', $bool[$pos] || +0), "\n"; } } }

      Update: Changed literal values to $margin.

        Thank you very much Chris, It is perfect now. A+++
Re: Hash_of_Hash_Would do it?
by GrandFather (Saint) on Sep 15, 2008 at 22:18 UTC

    The following modification of the code I posted in reply to your Complex Data Structure thread should probably do the trick:

    use strict; use warnings; my $data = <<DATA; CLS_S3_Contig100_st,CLS_S3_Contig100,13,10,0.3717 CLS_S3_Contig100_st,CLS_S3_Contig100,53,10,0.3717 CLS_S3_Contig10031_st,CLS_S3_Contig10031,53,12,0.5548 CLS_S3_Contig10031_st,CLS_S3_Contig10031,57,10,0.4871 CLS_S3_Contig10031_st,CLS_S3_Contig10031,61,12,0.547 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,62,11,0.5129 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,64,11,0.5789 DATA my %origins; my $numColumns; open my $inFile, '<', \$data; while (<$inFile>) { chomp; next unless length; my @columns = split ','; $numColumns ||= @columns; # Assume first row has correct column co +unt $origins{$columns[1]}[$columns[2] - 1] = \@columns; } close $inFile; for my $oKey (sort keys %origins) { my $origin = $origins{$oKey}; my $lastAdded = -1; # Fake up present flag for +/- 8 lines around real data for my $pip (0 .. $#$origin) { next if $pip <= $lastAdded or ! defined $origin->[$pip]; my $first = $pip > 8 ? $pip - 8 : 0; my $last = $pip + 8; $last = $#$origin if $last > $#$origin; for my $index ($first .. $last) { next if defined $origin->[$index]; @{$origin->[$index]}[0 .. $numColumns - 1] = ('', $oKey, $index + 1, ('') x $numColumns); } $lastAdded = $last; } # Now output the result for my $pip (0 .. $#$origin) { my $row = $origin->[$pip]; if (defined $row) { # pip exists in original file print join (",", @$row, '1'), "\n"; } else { # pip doesn't exist in original file print ",$oKey,", $pip + 1, ',' x ($numColumns - 2), "0\n"; } } }

    Prints:

    ... ,CLS_S3_Contig100,4,,,0 ,CLS_S3_Contig100,5,,,1 ,CLS_S3_Contig100,6,,,1 ,CLS_S3_Contig100,7,,,1 ,CLS_S3_Contig100,8,,,1 ,CLS_S3_Contig100,9,,,1 ,CLS_S3_Contig100,10,,,1 ,CLS_S3_Contig100,11,,,1 ,CLS_S3_Contig100,12,,,1 CLS_S3_Contig100_st,CLS_S3_Contig100,13,10,0.3717,1 ,CLS_S3_Contig100,14,,,1 ,CLS_S3_Contig100,15,,,1 ,CLS_S3_Contig100,16,,,1 ,CLS_S3_Contig100,17,,,1 ,CLS_S3_Contig100,18,,,1 ,CLS_S3_Contig100,19,,,1 ,CLS_S3_Contig100,20,,,1 ,CLS_S3_Contig100,21,,,1 ,CLS_S3_Contig100,22,,,0 ,CLS_S3_Contig100,23,,,0 ... ,CLS_S3_Contig100,43,,,0 ,CLS_S3_Contig100,44,,,0 ,CLS_S3_Contig100,45,,,1 ,CLS_S3_Contig100,46,,,1 ,CLS_S3_Contig100,47,,,1 ,CLS_S3_Contig100,48,,,1 ,CLS_S3_Contig100,49,,,1 ,CLS_S3_Contig100,50,,,1 ,CLS_S3_Contig100,51,,,1 ,CLS_S3_Contig100,52,,,1 CLS_S3_Contig100_st,CLS_S3_Contig100,53,10,0.3717,1 ,CLS_S3_Contig10031,1,,,0 ... ,CLS_S3_Contig10031,44,,,0 ,CLS_S3_Contig10031,45,,,1 ,CLS_S3_Contig10031,46,,,1 ,CLS_S3_Contig10031,47,,,1 ,CLS_S3_Contig10031,48,,,1 ,CLS_S3_Contig10031,49,,,1 ,CLS_S3_Contig10031,50,,,1 ,CLS_S3_Contig10031,51,,,1 ,CLS_S3_Contig10031,52,,,1 CLS_S3_Contig10031_st,CLS_S3_Contig10031,53,12,0.5548,1 ,CLS_S3_Contig10031,54,,,1 ,CLS_S3_Contig10031,55,,,1 ,CLS_S3_Contig10031,56,,,1 CLS_S3_Contig10031_st,CLS_S3_Contig10031,57,10,0.4871,1 ,CLS_S3_Contig10031,58,,,1 ,CLS_S3_Contig10031,59,,,1 ,CLS_S3_Contig10031,60,,,1 CLS_S3_Contig10031_st,CLS_S3_Contig10031,61,12,0.547,1 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,62,11,0.5129,1 ,CLS_S3_Contig10031,63,,,1 CLSS3627.b1_F19.ab1,CLS_S3_Contig10031,64,11,0.5789,1

    Perl reduces RSI - it saves typing
      Thank your very much GrandFather,

      The code works perfect, but when the gap between two adjacent PIP in the original file is large let's say 16 or more it behaves differently.

      Please take a look at the input and output files I have included here. In the output after 949 it has printed three "1"s instead of 8.

      I went through the code and I think a condition should be added here. To see if the gap in pip is greater or equal to 16. is my speculation right?

      my $first = $pip > 8 ? $pip - 8 : 0; my $last = $pip + 8;
      Original CLS_S3_Contig1000 933 10 0.488 CLS_S3_Contig1000 935 9 0.4867 CLS_S3_Contig1000 937 9 0.4989 CLS_S3_Contig1000 939 10 0.503 CLS_S3_Contig1000 943 8 0.3531 CLS_S3_Contig1000 949 8 0.3477 CLS_S3_Contig1000 1060 6 0.1498 CLS_S3_Contig1000 1070 9 0.3885 CLS_S3_Contig1000 1071 8 0.3848 OUTPUT CLS_S3_Contig1000 933 10 0.488 1 CLS_S3_Contig1000 934 1 CLS_S3_Contig1000 935 9 0.4867 1 CLS_S3_Contig1000 936 1 CLS_S3_Contig1000 937 9 0.4989 1 CLS_S3_Contig1000 938 1 CLS_S3_Contig1000 939 10 0.503 1 CLS_S3_Contig1000 940 1 CLS_S3_Contig1000 941 1 CLS_S3_Contig1000 942 1 CLS_S3_Contig1000 943 8 0.3531 1 CLS_S3_Contig1000 944 1 CLS_S3_Contig1000 945 1 CLS_S3_Contig1000 946 1 CLS_S3_Contig1000 947 1 CLS_S3_Contig1000 948 1 CLS_S3_Contig1000 949 8 0.3477 1 CLS_S3_Contig1000 950 1 CLS_S3_Contig1000 951 1 CLS_S3_Contig1000 952 0 CLS_S3_Contig1000 953 0 CLS_S3_Contig1000 954 0 CLS_S3_Contig1000 955 0 CLS_S3_Contig1000 956 0 CLS_S3_Contig1000 957 0

        It needs a little more work than that. Replace the fake up loop and preceding line with:

        my $lastToAdd = -1; # Fake up present flag for +/- 8 lines around real data for my $pip (0 .. $#$origin) { next if $pip > $lastToAdd and ! defined $origin->[$pip]; if ($pip <= $lastToAdd and ! defined $origin->[$pip]) { @{$origin->[$pip]}[0 .. $numColumns - 1] = ('', $oKey, $pip + 1, ('') x $numColumns); next; } my $first = $pip > 8 ? $pip - 8 : 0; for my $index ($first .. $pip) { next if defined $origin->[$index]; @{$origin->[$index]}[0 .. $numColumns - 1] = ('', $oKey, $index + 1, ('') x $numColumns); } $lastToAdd = $pip + 8; }

        The problem of course is that the finish point needs to be reset each time a new real line is found.


        Perl reduces RSI - it saves typing