Dear Monks, I have written a script to calculate the frequency(=coverage) of start sites on DNA strands. The start position is located in column 2 when strand is (+), and located in column 3 when strand is (-). And coverage/frequency is stored in two different hashes: one for each strand.

#!/usr/bin/perl use 5.28.0; my $lineCounter = 1; my %startPoints; my %startPoints_plusStrand; my %startPoints_minusStrand; my ($chr, $start, $end, $strand); open (my $OUT_FH, ">", "results.bed" ) or die "Cannot open file.\n $! +"; # Count frequency of start sited in the column-2. while(<DATA>){ next if /^#/; #ignore comments in the input file chomp (my $line = $_); ($chr, $start, $end, $strand) = (split /\t/, $line)[0,1,2,5]; my $startPoint; if($strand eq '+'){ $startPoint = $start; $startPoints_plusStrand{$startPoint}++; } elsif ($strand eq '-') { $startPoint = $end; $startPoints_minusStrand{$startPoint}++; } else { die "Goofy BED Data on line $lineCounter:\n$_"; } } say $OUT_FH "#Plus Strand BedGraph:"; for my $key (sort {$a <=> $b} keys %startPoints_plusStrand) { say $OUT_FH $key+1, "\t","+","$startPoints_plusStrand{$key}"; } __DATA__ #chromosome start end position col-5 strand chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127472363 127473530 Pos2 0 + chr7 127473530 127474697 Pos3 0 - chr7 127474697 127475864 Pos4 0 - #chr8 40000 40500 Pos5 0 + #chr8 40000 40600 Pos5 0 + #chr8 40000 40650 Pos5 0 + #chr8 41000 41200 Pos6 0 -

If I were doing this for single chromosome, it would work fine (chromosome 7 in my DATA). However, I want to do this for a .BED file containing all 23 human chromosomes, chr1, chr2, chr3 ...... What I want to do is that to calculate coverages for each chromosome and write the result into a single output file or multiple output files like chr1-covrage.txt, chr2-covrage.txt.

My input file is larger than 1 GB. And, example data could be expanded to multiple chrosomes by uncommenting the last 4 lines in the __DATA__ section, which gives chr and chr8. Like:

chr7 127471196 127472363 Pos1 0 + chr8 40000 40500 Pos5 0 +

I would appreciate any suggestions on how I could process the file as chunks (chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, ...etc) and write the results a single file or multiple files.


In reply to How to process a BED file as chunk of lines based on chromosome names by rnaeye

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.