Dear Monks, I have written a script to calculate the frequency(=coverage) of start sites on DNA strands. The start position is located in column 2 when strand is (+), and located in column 3 when strand is (-). And coverage/frequency is stored in two different hashes: one for each strand.
#!/usr/bin/perl use 5.28.0; my $lineCounter = 1; my %startPoints; my %startPoints_plusStrand; my %startPoints_minusStrand; my ($chr, $start, $end, $strand); open (my $OUT_FH, ">", "results.bed" ) or die "Cannot open file.\n $! +"; # Count frequency of start sited in the column-2. while(<DATA>){ next if /^#/; #ignore comments in the input file chomp (my $line = $_); ($chr, $start, $end, $strand) = (split /\t/, $line)[0,1,2,5]; my $startPoint; if($strand eq '+'){ $startPoint = $start; $startPoints_plusStrand{$startPoint}++; } elsif ($strand eq '-') { $startPoint = $end; $startPoints_minusStrand{$startPoint}++; } else { die "Goofy BED Data on line $lineCounter:\n$_"; } } say $OUT_FH "#Plus Strand BedGraph:"; for my $key (sort {$a <=> $b} keys %startPoints_plusStrand) { say $OUT_FH $key+1, "\t","+","$startPoints_plusStrand{$key}"; } __DATA__ #chromosome start end position col-5 strand chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127472363 127473530 Pos2 0 + chr7 127473530 127474697 Pos3 0 - chr7 127474697 127475864 Pos4 0 - #chr8 40000 40500 Pos5 0 + #chr8 40000 40600 Pos5 0 + #chr8 40000 40650 Pos5 0 + #chr8 41000 41200 Pos6 0 -
If I were doing this for single chromosome, it would work fine (chromosome 7 in my DATA). However, I want to do this for a .BED file containing all 23 human chromosomes, chr1, chr2, chr3 ...... What I want to do is that to calculate coverages for each chromosome and write the result into a single output file or multiple output files like chr1-covrage.txt, chr2-covrage.txt.
My input file is larger than 1 GB. And, example data could be expanded to multiple chrosomes by uncommenting the last 4 lines in the __DATA__ section, which gives chr and chr8. Like:
chr7 127471196 127472363 Pos1 0 + chr8 40000 40500 Pos5 0 +
I would appreciate any suggestions on how I could process the file as chunks (chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, ...etc) and write the results a single file or multiple files.
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |