mbp has asked for the wisdom of the Perl Monks concerning the following question:
Hi, I'm struggling with a (seemingly?) simple bit of code. I have the below type of input file:
C10000035 12 C 4 ....^>. HHFCC C10000035 13 C 6 .....^>. HHFFCC C10000035 14 C 6 ...... JHFFCC C10000035 15 C 6 ...... IHFFFC C10000035 16 A 4 .GG...^>G JGHFFFC C10000035 17 C 7 ....... JGHFFFC C10000035 18 C 8 .......^]. JIHHFFC@ C10000035 19 A 8 ........ IJHHFFFC C10000035 20 C 9 ..T...T.^]. JIHGHFF@C C10000035 21 G 10 A........^]. AJJHHHFDCC C10000040 30 C 5 ....^>. HHFCC C10000040 31 C 6 .....^>. HHFFCC C10000040 32 C 6 ...... JHFFCC C10000040 33 C 6 ...... IHFFFC C10000040 34 C 4 ...... IHFFFC C10000040 35 C 4 ...... IHFFFC C10000040 36 C 4 ...... IHFFFC C10000040 37 C 6 ...... IHFFFC C10000040 38 C 6 ...... IHFFFC
The first, second and fourth columns have essential information that I am trying to work with.
Note that the lines are sorted by first column first, then by second column (as shown).
What I am trying to do is the following:
1) Identify series of lines with a match in the first column, consecutive numbers in the second column, and fourth column values consistently greater than four.
2) Print out this information as: first column value, starting position (first second-column value in series), end position (last second-column value in series), length of series (difference between last two values +1)
For example, the above input data would ideally result in the following output:
C10000035,13,15,3 C10000035,17,21,5 C10000040,30,33,4 C10000040,37,38,2
My initial thought was to read the lines of the file until a line was encountered with a value in the fourth column greater than 4. At this point, the values in columns one and two would be pushed into arrays and a count value would be incremented, and this would continue until either a new first-column value was encountered, a non-consecutive value was encountered in column two, or a value less than five was encountered in column four. At this point, the process would start over until all of the lines in the file were 'processed'.
Here is the code I have so far:
#!/usr/bin/perl use strict; use warnings; #usage: perl script.pl <input_file> my $pileup =$ARGV[0]; # get filename from command line argument open (IN, $pileup) or die ("Could not open file.\n"); #test for file my @chroms; #initialize chroms array my @positions; #initialize positions array my $count = 0; #initialize count while ( my $line = <IN>){ #while line read from input file #split line into array from tab-delimited fields my @line = split("\t",$line); #check if element [3] (coverage depth) less than 5 if($line[3]<5){ #if above condition met, move to next line next; } else{ do{ #if above conditions not met, push element [0] #(chromosome name) into chroms array push @chroms, $line[0]; #push element [1] (position on chromosome) into #positions array push @positions, $line[1]; #increment count by one $count++; } #do the above until element added to chrom array does #not match previous elements OR #until position is not in sequence OR #element [3] of line (coverage depth) is less than 5 until ($chroms[0] ne $chroms[$count-1] || $positions[0] != $positions[$count-1]-length(@positions) || $line[3]<5); next; } } #print initial chromosome name, first position, last position, #length of span of consecutive positions print $chroms[0],"\t", $positions[0],"\t", $positions[$count-2],"\t", $positions[$count-2]-$positions[0]+1,"\n";
And here is the output I get:
C10000035 13 37 25
I haven't had much luck searching for an answer, and I am a bit new to control structures and line-by-line processing in this way in perl. I have also tried shifting the print statement and the array and count initializations around in their scope, but with different and still unsuccessful results.
I would greatly appreciate any advice the powerful minds of the perl monks could offer - am I on the right track, is there something simple I am missing? Thank you very much!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Looking for series in consecutive lines of a file
by toolic (Bishop) on Feb 17, 2015 at 02:00 UTC | |
by mbp (Novice) on Feb 17, 2015 at 05:27 UTC | |
by Athanasius (Archbishop) on Feb 17, 2015 at 06:52 UTC | |
by Anonymous Monk on Feb 17, 2015 at 08:04 UTC | |
by mbp (Novice) on Feb 17, 2015 at 23:16 UTC | |
by mbp (Novice) on Feb 17, 2015 at 23:12 UTC |