mbp has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm struggling with a (seemingly?) simple bit of code. I have the below type of input file:

C10000035 12 C 4 ....^>. HHFCC C10000035 13 C 6 .....^>. HHFFCC C10000035 14 C 6 ...... JHFFCC C10000035 15 C 6 ...... IHFFFC C10000035 16 A 4 .GG...^>G JGHFFFC C10000035 17 C 7 ....... JGHFFFC C10000035 18 C 8 .......^]. JIHHFFC@ C10000035 19 A 8 ........ IJHHFFFC C10000035 20 C 9 ..T...T.^]. JIHGHFF@C C10000035 21 G 10 A........^]. AJJHHHFDCC C10000040 30 C 5 ....^>. HHFCC C10000040 31 C 6 .....^>. HHFFCC C10000040 32 C 6 ...... JHFFCC C10000040 33 C 6 ...... IHFFFC C10000040 34 C 4 ...... IHFFFC C10000040 35 C 4 ...... IHFFFC C10000040 36 C 4 ...... IHFFFC C10000040 37 C 6 ...... IHFFFC C10000040 38 C 6 ...... IHFFFC

The first, second and fourth columns have essential information that I am trying to work with.

Note that the lines are sorted by first column first, then by second column (as shown).

What I am trying to do is the following:

1) Identify series of lines with a match in the first column, consecutive numbers in the second column, and fourth column values consistently greater than four.

2) Print out this information as: first column value, starting position (first second-column value in series), end position (last second-column value in series), length of series (difference between last two values +1)

For example, the above input data would ideally result in the following output:

C10000035,13,15,3 C10000035,17,21,5 C10000040,30,33,4 C10000040,37,38,2

My initial thought was to read the lines of the file until a line was encountered with a value in the fourth column greater than 4. At this point, the values in columns one and two would be pushed into arrays and a count value would be incremented, and this would continue until either a new first-column value was encountered, a non-consecutive value was encountered in column two, or a value less than five was encountered in column four. At this point, the process would start over until all of the lines in the file were 'processed'.

Here is the code I have so far:

#!/usr/bin/perl use strict; use warnings; #usage: perl script.pl <input_file> my $pileup =$ARGV[0]; # get filename from command line argument open (IN, $pileup) or die ("Could not open file.\n"); #test for file my @chroms; #initialize chroms array my @positions; #initialize positions array my $count = 0; #initialize count while ( my $line = <IN>){ #while line read from input file #split line into array from tab-delimited fields my @line = split("\t",$line); #check if element [3] (coverage depth) less than 5 if($line[3]<5){ #if above condition met, move to next line next; } else{ do{ #if above conditions not met, push element [0] #(chromosome name) into chroms array push @chroms, $line[0]; #push element [1] (position on chromosome) into #positions array push @positions, $line[1]; #increment count by one $count++; } #do the above until element added to chrom array does #not match previous elements OR #until position is not in sequence OR #element [3] of line (coverage depth) is less than 5 until ($chroms[0] ne $chroms[$count-1] || $positions[0] != $positions[$count-1]-length(@positions) || $line[3]<5); next; } } #print initial chromosome name, first position, last position, #length of span of consecutive positions print $chroms[0],"\t", $positions[0],"\t", $positions[$count-2],"\t", $positions[$count-2]-$positions[0]+1,"\n";

And here is the output I get:

C10000035 13 37 25

I haven't had much luck searching for an answer, and I am a bit new to control structures and line-by-line processing in this way in perl. I have also tried shifting the print statement and the array and count initializations around in their scope, but with different and still unsuccessful results.

I would greatly appreciate any advice the powerful minds of the perl monks could offer - am I on the right track, is there something simple I am missing? Thank you very much!

Replies are listed 'Best First'.
Re: Looking for series in consecutive lines of a file
by toolic (Bishop) on Feb 17, 2015 at 02:00 UTC
    This produces the desired output:
    use warnings; use strict; my %data; while (<DATA>) { my @cols = split; push @{ $data{$cols[0]} }, $cols[1] if $cols[3] > 4; } for my $k (sort keys %data) { my $i = 1; my @ns = @{ $data{$k} }; my $n0 = shift @ns; my @ns2 = $n0; for my $n (@ns) { if ($n == $n0+1) { push @ns2, $n; $i++; } else { print "$k,$ns2[0],$ns2[-1],$i\n"; @ns2 = $n; $i = 1; } $n0 = $n; } print "$k,$ns2[0],$ns2[-1],$i\n"; } __DATA__ C10000035 12 C 4 ....^>. HHFCC C10000035 13 C 6 .....^>. HHFFCC C10000035 14 C 6 ...... JHFFCC C10000035 15 C 6 ...... IHFFFC C10000035 16 A 4 .GG...^>G JGHFFFC C10000035 17 C 7 ....... JGHFFFC C10000035 18 C 8 .......^]. JIHHFFC@ C10000035 19 A 8 ........ IJHHFFFC C10000035 20 C 9 ..T...T.^]. JIHGHFF@C C10000035 21 G 10 A........^]. AJJHHHFDCC C10000040 30 C 5 ....^>. HHFCC C10000040 31 C 6 .....^>. HHFFCC C10000040 32 C 6 ...... JHFFCC C10000040 33 C 6 ...... IHFFFC C10000040 34 C 4 ...... IHFFFC C10000040 35 C 4 ...... IHFFFC C10000040 36 C 4 ...... IHFFFC C10000040 37 C 6 ...... IHFFFC C10000040 38 C 6 ...... IHFFFC

    See also:

      Hi toolic, thanks very much for that, I have it working now on my end also so that it accepts the data from an input file.

      As I understand it, it looks like your script reads in all of the data first (into the hash 'data'), prior to running the analysis. Is this correct? If so, is there a way rather to do the analysis piecemeal, in other words line by line? The reason I ask is that the files I have to deal with are quite large (up to 80G or so), so it may not be efficient or possible to store them in memory as opposed to parsing them line by line.

      I apologize, I should have made this clear in my original post - but that is why I had my original strategy of reading line by line and storing the first 'qualifying' line of a set and comparing it to the subsequent lines, and then starting the process over for each new set of consecutive lines.

      Thanks again, and thanks in advance if you are able to give advice on a line-by-line version. Cheers. MBP

        Hello mbp,

        Since the input lines are known to be sorted, it is feasible to reduce memory requirements by reading the input file line-by-line. Here is one approach:

        #! perl use strict; use warnings; use constant MIN_DEPTH => 5; my ($chromosome, $position, undef, $coverage_depth) = split /\s+/, <D +ATA>; my %series = ( name => $chromosome, start => $position, end => $position, depth => $coverage_depth, ); while (<DATA>) { ($chromosome, $position, undef, $coverage_depth) = split /\s+/; if ($series{name} eq $chromosome && $series{end} == $position - 1 && $series{depth} >= MIN_DEPTH && $coverage_depth >= MIN_DEPTH) { $series{end} = $position; } else { display_series(); %series = ( name => $chromosome, start => $position, end => $position, depth => $coverage_depth, ); } } display_series(); sub display_series { if ($series{depth} >= MIN_DEPTH) { print join(',', $series{name}, $series{start}, $series{end}, $series{end} - $series{start} + 1), "\n"; } } __DATA__ C10000035 12 C 4 ....^>. HHFCC C10000035 13 C 6 .....^>. HHFFCC C10000035 14 C 6 ...... JHFFCC C10000035 15 C 6 ...... IHFFFC C10000035 16 A 4 .GG...^>G JGHFFFC C10000035 17 C 7 ....... JGHFFFC C10000035 18 C 8 .......^]. JIHHFFC@ C10000035 19 A 8 ........ IJHHFFFC C10000035 20 C 9 ..T...T.^]. JIHGHFF@C C10000035 21 G 10 A........^]. AJJHHHFDCC C10000040 30 C 5 ....^>. HHFCC C10000040 31 C 6 .....^>. HHFFCC C10000040 32 C 6 ...... JHFFCC C10000040 33 C 6 ...... IHFFFC C10000040 34 C 4 ...... IHFFFC C10000040 35 C 4 ...... IHFFFC C10000040 36 C 4 ...... IHFFFC C10000040 37 C 6 ...... IHFFFC C10000040 38 C 6 ...... IHFFFC

        Output:

        16:48 >perl 1157_SoPW.pl C10000035,13,15,3 C10000035,17,21,5 C10000040,30,33,4 C10000040,37,38,2 16:50 >

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,