monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Beloved Monks,

I am looking for a really efficient way to compute a position weight matrix (PWM) from a set of strings. In each set the strings are of the same length. Basically PWM compute the frequency (or probabilities) of bases [ATCG] occur in each position/column of a string. For example the set of strings below:
AAA ATG TTT GTC Note that the length of these strings in the set maybe greater than 3.
Would give the following result:
$VAR1 = { 'A' => [2,1,1], 'T' => [1,3,1], 'C' => [0,0,1], 'G' => [1,0,1] }; So the size of the array is the same with the length of the string. In my case I need the variation of it, namely the probability of the each base occur in the particular position: $VAR = { 'A' => ['0.5','0.25','0.25'], 'T' => ['0.25','0.75','0.25'], 'C' => ['0','0','0.25'], 'G' => ['0.25','0','0.25'] }
Below is my incredibly naive, inefficient and ugly code. Can any body suggest a better and faster solution than this:
#!/usr/bin/perl -w use strict; use Data::Dumper; use Carp; my @Array = ('AAA', 'ATG', 'TTT', 'GTC'); my ($PWM) = compute_pwm(@Array); print Dumper $PWM; #The code that does the job sub compute_pwm { my @mi = @_; my $motif_count; my $L; #-------Beginning of PWM processing ---------------- foreach my $mi (@mi) { chomp($mi); $mi =~ s/\s//g; $L = $mi; my @words = split( /\W+/, $mi ); $motif_count += @words; } $motif_count = 0; my $w = length($L); my @A = (); my @T = (); my @C = (); my @G = (); for ( my $j = 0; $j < $w; $j++ ) { # Initialize the base counts. my $a = 0; my $c = 0; my $g = 0; my $t = 0; foreach my $mi (@mi) { chomp($mi); my $L = $mi; my $sb = substr( $L, $j, 1 ); while ( $sb =~ /a/ig ) { $a++ } while ( $sb =~ /t/ig ) { $t++ } while ( $sb =~ /c/ig ) { $c++ } while ( $sb =~ /g/ig ) { $g++ } } push( @A, $a ); push( @T, $t ); push( @C, $c ); push( @G, $g ); } my $sumA = sumArray(@A); my $sumT = sumArray(@T); my $sumC = sumArray(@C); my $sumG = sumArray(@G); my @m = (); my @Anm1 = (); my @Tnm1 = (); my @Cnm1 = (); my @Gnm1 = (); my @sPos = (); #summing up A,T,C,G for all position for ( my $b = 0; $b < $w; $b++ ) { my $sumPos = $A[$b] + $T[$b] + $C[$b] + $G[$b]; push( @sPos, $sumPos ); my $nm1A = $A[$b]/$sumPos; my $nm1T = $T[$b]/$sumPos; my $nm1C = $C[$b]/$sumPos; my $nm1G = $G[$b]/$sumPos; push( @Anm1, $nm1A ); push( @Tnm1, $nm1T ); push( @Cnm1, $nm1C ); push( @Gnm1, $nm1G ); } my @PWM = pwm( \@Anm1, \@Tnm1, \@Cnm1, \@Gnm1 ); return \@PWM; } #--------------- Subs of the subroutines ---------------------------- sub pwm { #PWM in forms of AoH my ($A,$T,$C,$G) = @_; #input are array references my (%Ah,%Th,%Ch,%Gh); $Ah{'A'} = [@$A]; $Th{'T'} = [@$T]; $Ch{'C'} = [@$C]; $Gh{'G'} = [@$G]; my @PWM; #AoH push @PWM, {%Ah,%Th,%Ch,%Gh}; return @PWM; } sub sumArray { my @arr = @_; my $sum = 0; my $count = $#arr + 1; for(my $i=0;$i<$count;$i++){ $sum += $arr[$i]; } return $sum; }

So here I am again, humbly seeking for your wisdom.

Regards,
Edward

Replies are listed 'Best First'.
Re: Position Weight Matrix of Set of Strings
by BrowserUk (Patriarch) on Feb 16, 2006 at 10:31 UTC

    I think this meets the spec

    #! perl -slw use strict; my %pwm; while( my $line = <DATA> ) { chomp $line; ++$pwm{ substr $line, $_, 1 }[ $_ ] for 0 .. length( $line ) -1; } my $n = $.; @$_ = map{ $_ ? $_ / $n : 0 } @$_ for values %pwm; print "$_ => @{ $pwm{ $_ } }" for keys %pwm; __DATA__ AAA ATG TTT GTC

    Produces

    C:\test>530623 A => 0.5 0.25 0.25 T => 0.25 0.75 0.25 C => 0 0 0.25 G => 0.25 0 0.25

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear BrowserUK,

      First of all I would like to apologize for coming back to you again.

      Your code worked fine as you presented.
      But when I attempted to put your code above as a function:
      use Data::Dumper; my @inst = ( 'AAA', 'ATG', 'TTT', 'GTC' ); get_pwm(@inst); sub get_pwm { my @data = @_; my %pwm; foreach my $line (@data) { ++$pwm{ substr $line, $_, 1 }[$_] for 0 .. length($line) - 1; } my $n = $.; @$_ = map { $_ ? $_ / $n : 0 } @$_ for values %pwm; # Line 26 print Dumper \%pwm; print "$_ => @{ $pwm{ $_ } }" for keys %pwm; return; }
      It gives:
      Use of uninitialized value in division (/) at pwm.pl line 26. Illegal division by zero at pwm.pl line 26. Attempt to free temp prematurely: SV 0x8f6a3f8, Perl interpreter: 0x8f +53008. Attempt to free unreferenced scalar: SV 0x8f6a3f8, Perl interpreter: 0 +x8f53008.
      How can I overcome the problem above.

      Regards,
      Edward

        The problem arises in this line  my $n = $.;. $. (also known as $INPUT_LINE_NUMBER. See perlvar), is the record or line number of the last line read from a file.

        In the original code that line came immediately after the while( my $line = <DATA> ) { loop, so $n was set to the number of lines in the file.

        In your code/function, you are not reading from a file, therefore $. is 0, so $n gets set to 0, hence the error in this line:

        $_ = map { $_ ? $_ / $n : 0 } @$_ for values %pwm; # Line 26

        As you are passing the data into the function as a list and assigning it to an array, the equivalent value is the number of elements in the array. Eg. scalar @data;. So, modifying your code as

        use Data::Dumper; my @inst = ( 'AAA', 'ATG', 'TTT', 'GTC' ); get_pwm(@inst); sub get_pwm { my @data = @_; my %pwm; foreach my $line (@data) { ++$pwm{ substr $line, $_, 1 }[$_] for 0 .. length($line) - 1; } my $n = @data; ### Modified @$_ = map { $_ ? $_ / $n : 0 } @$_ for values %pwm; # Line 26 print Dumper \%pwm; print "$_ => @{ $pwm{ $_ } }" for keys %pwm; return; }

        Produces this output (assuming -l):

        $VAR1 = { 'A' => [ '0.5', '0.25', '0.25' ], 'T' => [ '0.25', '0.75', '0.25' ], 'C' => [ 0, 0, '0.25' ], 'G' => [ '0.25', 0, '0.25' ] }; A => 0.5 0.25 0.25 T => 0.25 0.75 0.25 C => 0 0 0.25 G => 0.25 0 0.25

        Not sure why you are outputting the results two different ways. It's also not typical to output from within a subroutine.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Position Weight Matrix of Set of Strings
by TedPride (Priest) on Feb 16, 2006 at 13:09 UTC
    use strict; use warnings; use Data::Dumper; my (%letters, $l); while (<DATA>) { chomp; @_ = split //; $l++; $letters{$_[$_]}[$_]++ for (0..$#_); } for (keys %letters) { no warnings; $_ /= $l for @{$letters{$_}}; } print Dumper(\%letters); __DATA__ AAA ATG TTT GTC
Re: Position Weight Matrix of Set of Strings
by tweetiepooh (Hermit) on Feb 16, 2006 at 10:11 UTC
    A quick visit to CPAN and a search on DNA gave 6 pages of modules where people have coded stuff to work with DNA base sequences and similar.

    I've not looked deeply as although my degree is in Biochemistry I never really enjoyed genetics that much.

    Why not have a quick snoop over there and see if someone already has solved this problem?

    Not saying that there won't be those who sniff at the danger and love the challenge to develop new ways to do it better, that's partly what perl is all about.

    Update It seems some folk don't like this post. I guess that's what lifes about here in the monastery.

    It's just that I have noticed a fair few questions that appear to be about DNA bases and initially was going to suggest collecting some of the best responces and putting in a module. There are already plenty of DNA modules though hence the suggestion to check CPAN.

    The code solutions provided illustrate the latter point that TMTOWTDI is alive and well.