comment on

Dear Monks, I have written a script to calculate the frequency(=coverage) of start sites on DNA strands. The start position is located in column 2 when strand is (+), and located in column 3 when strand is (-). And coverage/frequency is stored in two different hashes: one for each strand.

 
#!/usr/bin/perl
use 5.28.0;
my $lineCounter = 1;
my %startPoints;
my %startPoints_plusStrand;
my %startPoints_minusStrand;
my ($chr, $start, $end, $strand);

open (my $OUT_FH,  ">", "results.bed" ) or die "Cannot open file.\n $!
+";

# Count frequency of start sited in the column-2.
while(<DATA>){
    next if /^#/; #ignore comments in the input file 
    chomp (my $line = $_);
    ($chr, $start, $end, $strand) = (split /\t/, $line)[0,1,2,5];
    my $startPoint;
    if($strand eq '+'){
            $startPoint = $start;
            $startPoints_plusStrand{$startPoint}++;
    } elsif ($strand eq '-') {
            $startPoint = $end;
            $startPoints_minusStrand{$startPoint}++;
    } else {
            die "Goofy BED Data on line $lineCounter:\n$_";
    }
}

say $OUT_FH "#Plus Strand BedGraph:";
for my $key (sort {$a <=> $b} keys %startPoints_plusStrand) {
        say $OUT_FH $key+1, "\t","+","$startPoints_plusStrand{$key}";
}
__DATA__
#chromosome    start    end    position    col-5    strand
chr7    127471196    127472363    Pos1    0    +
chr7    127471196    127472363    Pos1    0    +
chr7    127471196    127472363    Pos1    0    +
chr7    127471196    127472363    Pos1    0    +
chr7    127471196    127472363    Pos1    0    +
chr7    127472363    127473530    Pos2    0    +
chr7    127473530    127474697    Pos3    0    -
chr7    127474697    127475864    Pos4    0    -
#chr8    40000    40500    Pos5    0    +
#chr8    40000    40600    Pos5    0    +
#chr8    40000    40650    Pos5    0    +
#chr8    41000    41200    Pos6    0    -
[download]

If I were doing this for single chromosome, it would work fine (chromosome 7 in my DATA). However, I want to do this for a .BED file containing all 23 human chromosomes, chr1, chr2, chr3 ...... What I want to do is that to calculate coverages for each chromosome and write the result into a single output file or multiple output files like chr1-covrage.txt, chr2-covrage.txt.

My input file is larger than 1 GB. And, example data could be expanded to multiple chrosomes by uncommenting the last 4 lines in the __DATA__ section, which gives chr and chr8. Like:

chr7    127471196    127472363    Pos1    0    +
chr8    40000    40500    Pos5    0    +
[download]

I would appreciate any suggestions on how I could process the file as chunks (chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, ...etc) and write the results a single file or multiple files.

In reply to How to process a BED file as chunk of lines based on chromosome names by rnaeye

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.