Position Weight Matrix of Set of Strings

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Beloved Monks,

I am looking for a really efficient way to compute a position weight matrix (PWM) from a set of strings. In each set the strings are of the same length. Basically PWM compute the frequency (or probabilities) of bases [ATCG] occur in each position/column of a string. For example the set of strings below:

                    AAA
                    ATG
                    TTT
                    GTC

Note that the length of these strings in the set 
maybe greater than 3.
[download]

Would give the following result:

$VAR1 =  {
            'A' => [2,1,1],
            'T' => [1,3,1],
            'C' => [0,0,1],
            'G' => [1,0,1]
         };

So the size of the array is the same with the length of the string.

In my case I need the variation of it, namely the probability of the 
each base occur in the particular position:
$VAR =     {
            'A' => ['0.5','0.25','0.25'],
            'T' => ['0.25','0.75','0.25'],
            'C' => ['0','0','0.25'],
            'G' => ['0.25','0','0.25']
          }
[download]

Below is my incredibly naive, inefficient and ugly code. Can any body suggest a better and faster solution than this:

#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Carp;

my @Array = ('AAA',
             'ATG',
             'TTT',
             'GTC');

my ($PWM) = compute_pwm(@Array);


print Dumper $PWM;


#The code that does the job

sub compute_pwm {
    my @mi = @_;
    my $motif_count;
    my $L;

    #-------Beginning of PWM processing ----------------
    foreach my $mi (@mi) {
        chomp($mi);
        $mi =~ s/\s//g;
        $L = $mi;

        my @words = split( /\W+/, $mi );
        $motif_count += @words;
    }
    $motif_count = 0;

    my $w = length($L);
    my @A = ();
    my @T = ();
    my @C = ();
    my @G = ();

    for ( my $j = 0; $j < $w; $j++ ) {

        # Initialize the base counts.
        my $a = 0;
        my $c = 0;
        my $g = 0;
        my $t = 0;

        foreach my $mi (@mi) {
            chomp($mi);
            my $L = $mi;
            my $sb = substr( $L, $j, 1 );
            while ( $sb =~ /a/ig ) { $a++ }
            while ( $sb =~ /t/ig ) { $t++ }
            while ( $sb =~ /c/ig ) { $c++ }
            while ( $sb =~ /g/ig ) { $g++ }

        }
        push( @A, $a );
        push( @T, $t );
        push( @C, $c );
        push( @G, $g );
    }

    my $sumA = sumArray(@A);
    my $sumT = sumArray(@T);
    my $sumC = sumArray(@C);
    my $sumG = sumArray(@G);

    my @m    = ();
    my @Anm1 = ();
    my @Tnm1 = ();
    my @Cnm1 = ();
    my @Gnm1 = ();
    my @sPos = ();

    #summing up A,T,C,G for all position
    for ( my $b = 0; $b < $w; $b++ ) {
        my $sumPos = $A[$b] + $T[$b] + $C[$b] + $G[$b];
        push( @sPos, $sumPos );

        my $nm1A = $A[$b]/$sumPos;
        my $nm1T = $T[$b]/$sumPos;
        my $nm1C = $C[$b]/$sumPos;
        my $nm1G = $G[$b]/$sumPos;

        push( @Anm1, $nm1A );
        push( @Tnm1, $nm1T );
        push( @Cnm1, $nm1C );
        push( @Gnm1, $nm1G );
    }

    my @PWM = pwm( \@Anm1, \@Tnm1, \@Cnm1, \@Gnm1 );
    return \@PWM;

}




#--------------- Subs of the subroutines ----------------------------

sub pwm {
    #PWM in forms of AoH
    
     my ($A,$T,$C,$G) = @_; #input are array references
     my (%Ah,%Th,%Ch,%Gh);

     $Ah{'A'} = [@$A];
     $Th{'T'} = [@$T];
     $Ch{'C'} = [@$C];
     $Gh{'G'} = [@$G];

     my @PWM; #AoH
     push @PWM, {%Ah,%Th,%Ch,%Gh};

     return @PWM;
}

sub sumArray
{
  my @arr = @_;
  my $sum = 0;
  my $count = $#arr + 1;
  for(my $i=0;$i<$count;$i++){
     $sum += $arr[$i];
  }
  return $sum;
}
[download]

So here I am again, humbly seeking for your wisdom.

Regards,
Edward

Comment on Position Weight Matrix of Set of Strings Select or Download Code

Replies are listed 'Best First'.
Re: Position Weight Matrix of Set of Strings by BrowserUk (Patriarch) on Feb 16, 2006 at 10:31 UTC
I think this meets the spec `#! perl -slw use strict; my %pwm; while( my $line = <DATA> ) { chomp $line; ++$pwm{ substr $line, $_, 1 }[ $_ ] for 0 .. length( $line ) -1; } my $n = $.; @$_ = map{ $_ ? $_ / $n : 0 } @$_ for values %pwm; print "$_ => @{ $pwm{ $_ } }" for keys %pwm; __DATA__ AAA ATG TTT GTC` [download] Produces `C:\test>530623 A => 0.5 0.25 0.25 T => 0.25 0.75 0.25 C => 0 0 0.25 G => 0.25 0 0.25` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Position Weight Matrix of Set of Strings by monkfan (Curate) on Sep 06, 2006 at 08:49 UTC
Dear BrowserUK, First of all I would like to apologize for coming back to you again. Your code worked fine as you presented. But when I attempted to put your code above as a function: `use Data::Dumper; my @inst = ( 'AAA', 'ATG', 'TTT', 'GTC' ); get_pwm(@inst); sub get_pwm { my @data = @_; my %pwm; foreach my $line (@data) { ++$pwm{ substr $line, $_, 1 }[$_] for 0 .. length($line) - 1; } my $n = $.; @$_ = map { $_ ? $_ / $n : 0 } @$_ for values %pwm; # Line 26 print Dumper \%pwm; print "$_ => @{ $pwm{ $_ } }" for keys %pwm; return; }` [download] It gives: `Use of uninitialized value in division (/) at pwm.pl line 26. Illegal division by zero at pwm.pl line 26. Attempt to free temp prematurely: SV 0x8f6a3f8, Perl interpreter: 0x8f +53008. Attempt to free unreferenced scalar: SV 0x8f6a3f8, Perl interpreter: 0 +x8f53008.` [download] How can I overcome the problem above. Regards, Edward	[reply] [d/l] [select]
Re^3: Position Weight Matrix of Set of Strings by BrowserUk (Patriarch) on Sep 06, 2006 at 09:41 UTC
The problem arises in this line `my $n = $.;`. `$.` (also known as `$INPUT_LINE_NUMBER`. See perlvar), is the record or line number of the last line read from a file. In the original code that line came immediately after the `while( my $line = <DATA> ) {` loop, so `$n` was set to the number of lines in the file. In your code/function, you are not reading from a file, therefore `$.` is 0, so `$n` gets set to 0, hence the error in this line: `$_ = map { $_ ? $_ / $n : 0 } @$_ for values %pwm; # Line 26` [download] As you are passing the data into the function as a list and assigning it to an array, the equivalent value is the number of elements in the array. Eg. `scalar @data;`. So, modifying your code as `use Data::Dumper; my @inst = ( 'AAA', 'ATG', 'TTT', 'GTC' ); get_pwm(@inst); sub get_pwm { my @data = @_; my %pwm; foreach my $line (@data) { ++$pwm{ substr $line, $_, 1 }[$_] for 0 .. length($line) - 1; } my $n = @data; ### Modified @$_ = map { $_ ? $_ / $n : 0 } @$_ for values %pwm; # Line 26 print Dumper \%pwm; print "$_ => @{ $pwm{ $_ } }" for keys %pwm; return; }` [download] Produces this output (assuming -l): `$VAR1 = { 'A' => [ '0.5', '0.25', '0.25' ], 'T' => [ '0.25', '0.75', '0.25' ], 'C' => [ 0, 0, '0.25' ], 'G' => [ '0.25', 0, '0.25' ] }; A => 0.5 0.25 0.25 T => 0.25 0.75 0.25 C => 0 0 0.25 G => 0.25 0 0.25` [download] Not sure why you are outputting the results two different ways. It's also not typical to output from within a subroutine. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: Position Weight Matrix of Set of Strings by neversaint (Deacon) on Sep 07, 2006 at 02:31 UTC
Re^5: Position Weight Matrix of Set of Strings by BrowserUk (Patriarch) on Sep 07, 2006 at 03:55 UTC
Some notes below your chosen depth have not been shown here
Re: Position Weight Matrix of Set of Strings by TedPride (Priest) on Feb 16, 2006 at 13:09 UTC
`use strict; use warnings; use Data::Dumper; my (%letters, $l); while (<DATA>) { chomp; @_ = split //; $l++; $letters{$_[$_]}[$_]++ for (0..$#_); } for (keys %letters) { no warnings; $_ /= $l for @{$letters{$_}}; } print Dumper(\%letters); __DATA__ AAA ATG TTT GTC` [download]	[reply] [d/l]
Re: Position Weight Matrix of Set of Strings by tweetiepooh (Hermit) on Feb 16, 2006 at 10:11 UTC
A quick visit to CPAN and a search on DNA gave 6 pages of modules where people have coded stuff to work with DNA base sequences and similar. I've not looked deeply as although my degree is in Biochemistry I never really enjoyed genetics that much. Why not have a quick snoop over there and see if someone already has solved this problem? Not saying that there won't be those who sniff at the danger and love the challenge to develop new ways to do it better, that's partly what perl is all about. Update It seems some folk don't like this post. I guess that's what lifes about here in the monastery. It's just that I have noticed a fair few questions that appear to be about DNA bases and initially was going to suggest collecting some of the best responces and putting in a module. There are already plenty of DNA modules though hence the suggestion to check CPAN. The code solutions provided illustrate the latter point that TMTOWTDI is alive and well.	[reply]