Looking for series in consecutive lines of a file

mbp has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm struggling with a (seemingly?) simple bit of code. I have the below type of input file:

C10000035    12    C    4    ....^>.    HHFCC
C10000035    13    C    6    .....^>.    HHFFCC
C10000035    14    C    6    ......    JHFFCC
C10000035    15    C    6    ......    IHFFFC
C10000035    16    A    4    .GG...^>G    JGHFFFC
C10000035    17    C    7    .......    JGHFFFC
C10000035    18    C    8    .......^].    JIHHFFC@
C10000035    19    A    8    ........    IJHHFFFC
C10000035    20    C    9    ..T...T.^].    JIHGHFF@C
C10000035    21    G    10    A........^].    AJJHHHFDCC
C10000040    30    C    5    ....^>.    HHFCC
C10000040    31    C    6    .....^>.    HHFFCC
C10000040    32    C    6    ......    JHFFCC
C10000040    33    C    6    ......    IHFFFC
C10000040    34    C    4    ......    IHFFFC
C10000040    35    C    4    ......    IHFFFC
C10000040    36    C    4    ......    IHFFFC
C10000040    37    C    6    ......    IHFFFC
C10000040    38    C    6    ......    IHFFFC
[download]

The first, second and fourth columns have essential information that I am trying to work with.

Note that the lines are sorted by first column first, then by second column (as shown).

What I am trying to do is the following:

1) Identify series of lines with a match in the first column, consecutive numbers in the second column, and fourth column values consistently greater than four.

2) Print out this information as: first column value, starting position (first second-column value in series), end position (last second-column value in series), length of series (difference between last two values +1)

For example, the above input data would ideally result in the following output:


C10000035,13,15,3
C10000035,17,21,5
C10000040,30,33,4
C10000040,37,38,2
[download]

My initial thought was to read the lines of the file until a line was encountered with a value in the fourth column greater than 4. At this point, the values in columns one and two would be pushed into arrays and a count value would be incremented, and this would continue until either a new first-column value was encountered, a non-consecutive value was encountered in column two, or a value less than five was encountered in column four. At this point, the process would start over until all of the lines in the file were 'processed'.

Here is the code I have so far:


#!/usr/bin/perl


use strict;
use warnings;

#usage: perl script.pl <input_file>


my $pileup =$ARGV[0];   # get filename from command line argument


open (IN, $pileup) or die ("Could not open file.\n");   #test for file

my @chroms;         #initialize chroms array
my @positions;      #initialize positions array
my $count = 0;      #initialize count


while ( my $line = <IN>){          #while line read from input file
    
    #split line into array from tab-delimited fields
    my @line = split("\t",$line);
    
    #check if element [3] (coverage depth) less than 5
    if($line[3]<5){
        
        #if above condition met, move to next line
        next;
    }
    else{

        do{
            #if above conditions not met, push element [0] 
            #(chromosome name) into chroms array
            push @chroms, $line[0];
            
            #push element [1] (position on chromosome) into 
            #positions array
            push @positions, $line[1];
            
            #increment count by one
            $count++;
        
        }

        
        #do the above until element added to chrom array does 
        #not match previous elements OR
        #until position is not in sequence OR
        #element [3] of line (coverage depth) is less than 5
        
        until ($chroms[0] ne $chroms[$count-1] || 
        $positions[0] != $positions[$count-1]-length(@positions) ||
        $line[3]<5);
        next;
        

        }
    }

#print initial chromosome name, first position, last position, 
#length of span of consecutive positions

print   $chroms[0],"\t",
        $positions[0],"\t",
        $positions[$count-2],"\t",
        $positions[$count-2]-$positions[0]+1,"\n";
[download]

And here is the output I get:


C10000035    13    37    25
[download]

I haven't had much luck searching for an answer, and I am a bit new to control structures and line-by-line processing in this way in perl. I have also tried shifting the print statement and the array and count initializations around in their scope, but with different and still unsuccessful results.

I would greatly appreciate any advice the powerful minds of the perl monks could offer - am I on the right track, is there something simple I am missing? Thank you very much!

Comment on Looking for series in consecutive lines of a file Select or Download Code

Replies are listed 'Best First'.

Re: Looking for series in consecutive lines of a file
by toolic (Bishop) on Feb 17, 2015 at 02:00 UTC

use warnings;
use strict;

my %data;
while (<DATA>) {
    my @cols = split;
    push @{ $data{$cols[0]} }, $cols[1] if $cols[3] > 4;
}

for my $k (sort keys %data) {
    my $i = 1;
    my @ns = @{ $data{$k} };
    my $n0 = shift @ns;
    my @ns2 = $n0;
    for my $n (@ns) {
        if ($n == $n0+1) {
            push @ns2, $n;
            $i++;
        }
        else {
            print "$k,$ns2[0],$ns2[-1],$i\n";
            @ns2 = $n;
            $i = 1;
        }
        $n0 = $n;
    }
    print "$k,$ns2[0],$ns2[-1],$i\n";
}

__DATA__
C10000035    12    C    4    ....^>.    HHFCC
C10000035    13    C    6    .....^>.    HHFFCC
C10000035    14    C    6    ......    JHFFCC
C10000035    15    C    6    ......    IHFFFC
C10000035    16    A    4    .GG...^>G    JGHFFFC
C10000035    17    C    7    .......    JGHFFFC
C10000035    18    C    8    .......^].    JIHHFFC@
C10000035    19    A    8    ........    IJHHFFFC
C10000035    20    C    9    ..T...T.^].    JIHGHFF@C
C10000035    21    G    10    A........^].    AJJHHHFDCC
C10000040    30    C    5    ....^>.    HHFCC
C10000040    31    C    6    .....^>.    HHFFCC
C10000040    32    C    6    ......    JHFFCC
C10000040    33    C    6    ......    IHFFFC
C10000040    34    C    4    ......    IHFFFC
C10000040    35    C    4    ......    IHFFFC
C10000040    36    C    4    ......    IHFFFC
C10000040    37    C    6    ......    IHFFFC
C10000040    38    C    6    ......    IHFFFC
[download]

See also:

perldsc

[reply]
[d/l]

Re^2: Looking for series in consecutive lines of a file

by mbp (Novice) on Feb 17, 2015 at 05:27 UTC

Hi toolic, thanks very much for that, I have it working now on my end also so that it accepts the data from an input file.

As I understand it, it looks like your script reads in all of the data first (into the hash 'data'), prior to running the analysis. Is this correct? If so, is there a way rather to do the analysis piecemeal, in other words line by line? The reason I ask is that the files I have to deal with are quite large (up to 80G or so), so it may not be efficient or possible to store them in memory as opposed to parsing them line by line.

I apologize, I should have made this clear in my original post - but that is why I had my original strategy of reading line by line and storing the first 'qualifying' line of a set and comparing it to the subsequent lines, and then starting the process over for each new set of consecutive lines.

Thanks again, and thanks in advance if you are able to give advice on a line-by-line version. Cheers. MBP

[reply]

Re^3: Looking for series in consecutive lines of a file

by Athanasius (Archbishop) on Feb 17, 2015 at 06:52 UTC

Hello mbp,

Since the input lines are known to be sorted, it is feasible to reduce memory requirements by reading the input file line-by-line. Here is one approach:

#! perl
use strict;
use warnings;
use constant MIN_DEPTH => 5;

my  ($chromosome, $position, undef, $coverage_depth) = split /\s+/, <D
+ATA>;
my   %series =
(
    name  => $chromosome,
    start => $position,
    end   => $position,
    depth => $coverage_depth,
);

while (<DATA>)
{
    ($chromosome, $position, undef, $coverage_depth) = split /\s+/;

    if ($series{name}   eq $chromosome   &&
        $series{end}    == $position - 1 &&
        $series{depth}  >= MIN_DEPTH     &&
        $coverage_depth >= MIN_DEPTH)
    {
        $series{end}     = $position;
    }
    else
    {
        display_series();

        %series =
        (
            name  => $chromosome,
            start => $position,
            end   => $position,
            depth => $coverage_depth,
        );
    }
}

display_series();

sub display_series
{
    if ($series{depth} >= MIN_DEPTH)
    {
        print join(',', $series{name}, $series{start}, $series{end},
                        $series{end} - $series{start} + 1), "\n";
    }
}

__DATA__
C10000035    12    C    4    ....^>.    HHFCC
C10000035    13    C    6    .....^>.    HHFFCC
C10000035    14    C    6    ......    JHFFCC
C10000035    15    C    6    ......    IHFFFC
C10000035    16    A    4    .GG...^>G    JGHFFFC
C10000035    17    C    7    .......    JGHFFFC
C10000035    18    C    8    .......^].    JIHHFFC@
C10000035    19    A    8    ........    IJHHFFFC
C10000035    20    C    9    ..T...T.^].    JIHGHFF@C
C10000035    21    G    10    A........^].    AJJHHHFDCC
C10000040    30    C    5    ....^>.    HHFCC
C10000040    31    C    6    .....^>.    HHFFCC
C10000040    32    C    6    ......    JHFFCC
C10000040    33    C    6    ......    IHFFFC
C10000040    34    C    4    ......    IHFFFC
C10000040    35    C    4    ......    IHFFFC
C10000040    36    C    4    ......    IHFFFC
C10000040    37    C    6    ......    IHFFFC
C10000040    38    C    6    ......    IHFFFC
[download]

Output:

16:48 >perl 1157_SoPW.pl
C10000035,13,15,3
C10000035,17,21,5
C10000040,30,33,4
C10000040,37,38,2

16:50 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^4: Looking for series in consecutive lines of a file

by Anonymous Monk on Feb 17, 2015 at 08:04 UTC

Re^5: Looking for series in consecutive lines of a file

by mbp (Novice) on Feb 17, 2015 at 23:16 UTC

Re^4: Looking for series in consecutive lines of a file

by mbp (Novice) on Feb 17, 2015 at 23:12 UTC