in reply to Re^2: Looking for series in consecutive lines of a file
in thread Looking for series in consecutive lines of a file

Hello mbp,

Since the input lines are known to be sorted, it is feasible to reduce memory requirements by reading the input file line-by-line. Here is one approach:

#! perl use strict; use warnings; use constant MIN_DEPTH => 5; my ($chromosome, $position, undef, $coverage_depth) = split /\s+/, <D +ATA>; my %series = ( name => $chromosome, start => $position, end => $position, depth => $coverage_depth, ); while (<DATA>) { ($chromosome, $position, undef, $coverage_depth) = split /\s+/; if ($series{name} eq $chromosome && $series{end} == $position - 1 && $series{depth} >= MIN_DEPTH && $coverage_depth >= MIN_DEPTH) { $series{end} = $position; } else { display_series(); %series = ( name => $chromosome, start => $position, end => $position, depth => $coverage_depth, ); } } display_series(); sub display_series { if ($series{depth} >= MIN_DEPTH) { print join(',', $series{name}, $series{start}, $series{end}, $series{end} - $series{start} + 1), "\n"; } } __DATA__ C10000035 12 C 4 ....^>. HHFCC C10000035 13 C 6 .....^>. HHFFCC C10000035 14 C 6 ...... JHFFCC C10000035 15 C 6 ...... IHFFFC C10000035 16 A 4 .GG...^>G JGHFFFC C10000035 17 C 7 ....... JGHFFFC C10000035 18 C 8 .......^]. JIHHFFC@ C10000035 19 A 8 ........ IJHHFFFC C10000035 20 C 9 ..T...T.^]. JIHGHFF@C C10000035 21 G 10 A........^]. AJJHHHFDCC C10000040 30 C 5 ....^>. HHFCC C10000040 31 C 6 .....^>. HHFFCC C10000040 32 C 6 ...... JHFFCC C10000040 33 C 6 ...... IHFFFC C10000040 34 C 4 ...... IHFFFC C10000040 35 C 4 ...... IHFFFC C10000040 36 C 4 ...... IHFFFC C10000040 37 C 6 ...... IHFFFC C10000040 38 C 6 ...... IHFFFC

Output:

16:48 >perl 1157_SoPW.pl C10000035,13,15,3 C10000035,17,21,5 C10000040,30,33,4 C10000040,37,38,2 16:50 >

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^4: Looking for series in consecutive lines of a file
by Anonymous Monk on Feb 17, 2015 at 08:04 UTC

    I see you're trying not to overwhelm the new guy, but introducing "use constant" and not introducing subroutine arguments, references? Eeew :)

    ... display_series( \%series ); ... display_series( \%series ); ... sub display_series { my( $series ) = @_; if ($series->{depth} >= MIN_DEPTH) { print join(',', $series->{name}, $series->{start}, $series->{e +nd}, $series->{end} - $series->{start} + 1), "\n"; } }

      Anonymous Monk, haha, yes I do have a ways to go, but thanks for your addition!

Re^4: Looking for series in consecutive lines of a file
by mbp (Novice) on Feb 17, 2015 at 23:12 UTC

    Hi Anathasius,

    Brilliant, that works a treat! Thank you very much for your time and help, I really appreciate it.

    Best,

    mbp