sally123 has asked for the wisdom of the Perl Monks concerning the following question:

I have to count the number of overlapping dimers (AA, AG, AC, AT, GA, GG, GC, GT, CC, CG, CA, CT, TT, TA, TG, TC) in multiple sequences using PERL.

I wrote the following code but it only works for 1 sequence. How can I extend it to multiple sequences?

#!/usr/bin/perl -w open FH, "sample.txt"; $genome=<FH>; %count=(); $count{substr($genome, $_, 2)}++ for 0..length($genome)-2; print "AA: $count{AA}\n"; print "AG: $count{AG}\n"; print "AC: $count{AC}\n"; print "AT: $count{AT}\n"; print "TA: $count{TA}\n"; print "TG: $count{TG}\n"; print "TC: $count{TC}\n"; print "TT: $count{TT}\n"; print "GA: $count{GA}\n"; print "GG: $count{GG}\n"; print "GC: $count{GC}\n"; print "GT: $count{GT}\n"; print "CA: $count{CA}\n"; print "CG: $count{CG}\n"; print "CC: $count{CC}\n"; print "CT: $count{CT}\n";

I need 1. counts for each sequence and 2. the total counts

Input Example: Sample.txt

ATGGGCTCCTCCGCCATCACCGTGAGCTTCCTCCTCTTTCTGGCATTTCAGCTCCCAGGGCAAACAGGAG +CAAATCCCGTGTATGGCTCTGTGTCCAATGCAGACCTGATGGATTTCAAGTAAAAG ATGGTGAGAAAA +TGGGCCCTGCTCCTGCCCATGCTGCTCTGCGGCCTGACTGGTCCCGCACACCTCTTCCAGCCAAGCCTG +GTGCTGGAGATGGCCCAGGTCCTCTTGGATAACTACTGCTTCCCAGAGAACCTGATGGGGATGCAGGGA +GCCATCGAGCAGGCCATCAAAAGTCAGGAGATTCTGTCTATCTCAGACCCTCAGACTCTGGCCCATGTG +TTGACAGCTGGGGTGCAGAGCTCCTTGAATGACCCTCGCCTGGTCATCTCCTATGAGCCCAGCACCCTC +GAGGCCCCTCCGCGAGCTCCAGCAGTCACGAACCTCACACTAGAGGAAATCATCGCAGGGCTGCAGGAT +GGCCTTCGCCATGAGATTCTGGAAGGCAATGTGGGCTACCTGCGGGTGGACGACATCCCGGGCCAGGAG +GTGATGAGCAAGCTGAGGAGCTTCCTGGTGGCCAACGTCTGGAGGAAGCTCGTGAACACGTCCGCCTTG +GTGCTGGACCTCCGGCACTGCACTGGGGGACACGTGTCTGGCATCCCCTATGTCATCTCCTACCTGCAC +CCAGGGAGCACAGTCTCGCACGTGGACACCGTCTACGACCGCCCCTCCAACACAACCACTGAGATCTGG +ACCCTGCCTGAAGCCCTGGGAGAGAAGTACAGTGCAGACAAGGATGTGGTGGTCCTCACCAGCAGCCGC +ACGGGGGGCGTGGCTGAGGACATCGCTTACATCCTCAAACAGATGCGCAGGGCCATCGTGGTGGGCGAG +CGGACTGTTGGGGGGGCTCTGAACCTCCAGAAGCTGAGGGTAGGCCAGTCCGACTTCTTTCTCACTGTG +CCTGTGTCCAGATCCCTGGGGCCCCTGGGTGAGGGCAGCCAGACGTGGGAGGGCAGTGGGGTGCTGCCC +TGTGTGGGGACACCGGCCGAGCAGGCCCTGGAGAAAGCCCTGGCCGTTCTCATGCTGCGCAGGGCCCTG +CCAGGAGTCATTCAGCGCCTTCAGGAGGCGCTGCGCGAGTACTACACGCTGGTGGACCGTGTGCCCGCC +CTGCTGAGCCACCTGGCCGCCATGGACCTGTCCTCGGTGGTCTCCGAGGACGATCTGGTCACTAAGCTC +AATGCTGGCCTGCAGGCTGTGTCTGAGGACCCCAGGCTCCAGGTGCAGGTGGTCAGACCCAAAGAAGCC +TCTTCTGGGCCTGAGGAAGAAGCTGAAGAACCTCCAGAGGCGGTCCCGGAAGTGCCCGAGGACGAGGCT +GTTCGGCGGGCTCTGGTGGACTCCGTGTTCCAGGTTTCTGTGCTGCCGGGCAACGTGGGCTACCTGCGC +TTCGACAGTTTCGCTGATGCCTCTGTCCTGGAGGTGCTGGGCCCCTACATCCTGCACCAGGTGTGGGAG +CCCCTGCAGGACACGGAGCACCTCATCATGGACCTGCGGCAGAACCCCGGGGGGCCGTCCTCCGCGGTG +CCCCTGCTGCTCTCCTACTTCCAGAGCCCTGACGCCAGCCCCGTGCGCCTCTTCTCCACCTACGACCGG +CGCACCAACATCACACGCGAGCACTTCAGCCAGACGGAGCTGCTGGGCCGGCCCTACGGCACCCAGCGT +GGCGTGTACCTGCTCACTAGCCACCGCACCGCCACCGCGGCCGAGGAGCTGGCCTTCCTCATGCAGTCA +CTGGGCTGGGCCACGCTGGTGGGCGAGATCACCGCGGGCAGCCTGCTGCACACACACACAGTATCCCTG +CTGGAGACGCCCGAGGGCGGCCTGGCGCTCACGGTGCCTGTGCTCACCTTCATCGACAACCATGGCGAG +TGCTGGCTGGGGGGCGGTGTGGTCCCCGATGCCATTGTGCTGGCCGAGGAAGCCCTAGACAGAGCCCAG +GAGGTGCTGGAGTTCCACCGAAGCTTGGGGGAGTTGGTGGAAGGCACGGGGCGCCTGCTGGAGGCCCAC +TACGCTCGGCCAGAGGTCGTGGGGCAGATGGGTGCCCTGCTGCGAGCCAAGCTGGCCCAGGGGGCCTAC +CGCACCGCGGTGGACCTGGAGTCGCTGGCTTCCCAGCTTACGGCCGACCTGCAGGAGATGTCTGGGGAC +CACCGTCTGCTGGTGTTCCACAGCCCCGGCGAAATGGTGGCTGAGGAGGCGCCCCCACCGCCTCCCGTC +GTCCCCTCCCCGGAGGAGCTGTCCTATCTCATCGAGGCCCTGTTCAAGACTGAGGTGCTGCCCGGCCAG +CTGGGCTACCTGCGTTTCGACGCCATGGCTGAGCTGGAGACGGTGAAGGCCGTCGGGCCACAGCTGGTG +CAGCTGGTGTGGCAGAAGCTGGTGGACACGGCCGCGCTGGTGGTCGACCTGCGCTACAACCCCGGCAGC +TACTCCACAGCCGTGCCTCTACTCTGCTCCTACTTCTTCGAGGCAGAGCCCCGCCGGCACCTCTACTCT +GTCTTTGACAGGGCCACGTCAAGGGTCACAGAGGTATGGACCCTGCCCCACGTTACAGGCCAGCGCTAT +GGCTCCCACAAGGACCTCTACGTTCTGGTGAGCCACACCAGCGGTTCAGCAGCTGAGGCTTTTGCTCAC +ACCATGCAGGATCTGCAGCGAGCCACCATCATCGGGGAGCCCACGGCCGGAGGGGCACTCTCCGTGGGA +ATCTACCAGGTGGGCAGCAGCGCCTTATACGCCTCCATGCCCACGCAGATGGCCATGAGTGCCAGCACC +GGCGAGGCCTGGGATCTGGCTGGGGTGGAGCCGGACATCACTGTGCCCATGAGCGTGGCCCTCTCCACA +GCCCGGGACATAGTGACCCTGCGTGCCAAGGTGCCCACTGTGCTGCAGACAGCTGGGAAGCTCGTAGCG +GATAACTACGCCTCCCCTGAGCTGGGAGTCAAGATGGCAGCCGAACTGAGTGGTCTGCAGAGCCGCTAT +GCCAGGGTGACCTCAGAAGCCGCCCTCGCCGAGCTGCTGCAAGCCGACCTGCAGGTGCTGTCCGGGGAC +CCACACCTGAAGACAGCTCATATACCTGAGGATGCCAAAGACCGCATTCCTGGCATTGTACCCATGCAG +TAACAG ATGGACATGATGGACGGCTGCCAGTTCTCGCCCTCTGAGTACTTCTACGACGGCTCCTGCAT +CCCATCCCCCGACGGTGAGTTCGGGGACGAGTTTGAGCCGCGAGTGGCTGCTTTCGGGGCTCACAAGGC +AGACCTGCAAGGCTCAGACGAGGACGAGCACGTGCGAGCACCCACGGGCCACCACCAGGCCGGCCACTG +CCTCATGTGGGCCTGCAAAGCATGCAAAAGGAAGTCCACCACCATGGATCGGCGGAAGGCGGCCACCAT +GCGCGAGCGGAGACGCCTGAAGAAGGTCAACCAGGCTTTCGACACGCTCAAGCGGTGCACCACGACCAA +CCCTAACCAGAGGCTGCCCAAGGTGGAGATCCTCAGGAATGCCATCCGCTACATTGAGAGTCTGCAGGA +GCTGCTTAGGGAACAGGTGGAAAACTACTATAGCCTGCCGGGGCAGAGCTGCTCTGAGCCCACCAGCCC +CACCTCAAGTTGCTCTGATGGCATGTAAATG

Many thanks!

Replies are listed 'Best First'.
Re: Counting overlapping dimers for multiple sequences
by flexvault (Monsignor) on Mar 26, 2012 at 08:14 UTC

    sally123,

    Two things immediately jump out!

    • You need to loop on reading your data.
    • You need to initialize your variables before you loop.

    Untested, this should get you started.
    #!/usr/bin/perl -w use strict; open my $FH, "<", "./sample.txt"; my %count = (); while ( my $genome = <$FH> ) { .... }

    A good practice is to define the scope of your variables (my, local, our).

    Good Luck

    "Well done is better than well said." - Benjamin Franklin

Re: Counting overlapping dimers for multiple sequences
by johngg (Canon) on Mar 26, 2012 at 12:51 UTC

    Assuming that you are storing your sequences in files with meaningful names you could do something like this. Given these two example sequence files

    knoppix@Microknoppix:~/perl/Monks$ head -99 spw961600_seq* ==> spw961600_seqA <== TCCAGATCCCTGGGGCCCCTGGGTGAGGGCAGCCAGACGCAACGTCTGGAGGAAGCT ==> spw961600_seqB <== CTGCGTTTCGACGCCATGGCTGAGCTGGAGACGGTCCTGCCCATGCTGCTC knoppix@Microknoppix:~/perl/Monks$

    this script

    use strict; use warnings; use Data::Dumper; my @seqFiles = glob q{spw961600_seq*}; my %counts; foreach my $seqFile ( @seqFiles ) { my $seq = do { open my $seqFH, q{<}, $seqFile or die qq{open: < $seqFile: $!\n}; local $/; <$seqFH>; }; while ( $seq =~ m{(?=(..))}g ) { $counts{ totals }->{ $1 } ++; $counts{ $seqFile }->{ $1 } ++; } } print Data::Dumper->Dumpxs( [ \ %counts ], [ qw{ *counts } ] );

    builds this data structure

    %counts = ( 'spw961600_seqB' => { 'AC' => 2, 'AG' => 2, 'CC' => 4, 'TG' => 7, 'AT' => 2, 'TC' => 3, 'GA' => 4, 'TT' => 2, 'CT' => 6, 'GG' => 3, 'CG' => 4, 'CA' => 2, 'GC' => 7, 'GT' => 2 }, 'spw961600_seqA' => { 'AC' => 2, 'AG' => 6, 'CC' => 7, 'TG' => 4, 'AT' => 1, 'TC' => 3, 'AA' => 2, 'GA' => 5, 'CT' => 4, 'CG' => 2, 'GG' => 9, 'GC' => 5, 'CA' => 4, 'GT' => 2 }, 'totals' => { 'AC' => 4, 'AG' => 8, 'CC' => 11, 'TG' => 11, 'AT' => 3, 'TC' => 6, 'AA' => 2, 'GA' => 9, 'TT' => 2, 'CT' => 10, 'CG' => 6, 'GG' => 12, 'GC' => 12, 'CA' => 6, 'GT' => 4 } );

    I hope I have understood your question correctly and that this will help you move forward.

    Cheers,

    JohnGG

Re: Counting overlapping dimers for multiple sequences
by ashish.kvarma (Monk) on Mar 26, 2012 at 08:20 UTC
    By assigning a file handle to a scalar you are only reading one line from it. To solve the problem you just need to put your code in a loop.

    Code snippet below should give you some idea

    while (my $genome= <FH>) { for (0..length($genome)-2) { my $dimer = substr($genome, $_, 2); $dimers[$idx]{$dimer}++ ; #seq wise length of dimer } $idx++; }
    Update: Fixed typo. Anyhow post seems similar to of flexvault :)
    Regards,
    Ashish