Re^2: bioinformatics problem

Replies are listed 'Best First'.
Re^3: bioinformatics problem - local scoring solution by bobf (Monsignor) on Jun 03, 2006 at 06:04 UTC
Yes, the subsequences are created by taking overlapping windows of sequences starting at position 1, 2, etc. I whipped up some code that replicates the scoring function described on the site. It gave me exactly the same results as the website version, but performing the calculation locally gave me all of the advantages I listed in my previous post plus: the local scoring is faster than using the website this code does not suffer from the 5000 residue limit imposed by the site The only caveat is that the rankings for subsequences with equivalent scores might be shuffled (the site gave no rules as to how sequences with the same score are ordered). The code below uses the default settings and the sequence from the OP. I know it could be cleaned up a bit, but the steps should be clear. # HLA Peptide Binding Predictions # http://thr.cit.nih.gov/molbio/hla_bind/ use warnings; use strict; use Data::Dumper; # load the scoring matrix # 9-mer Coefficient Table for HLA_A_0201 (A_0201_standard) my $nmer_size = 9; my $final_const = 0.069; my %mx; while( my $line = <DATA> ) { chomp $line; my ( $aa, @scores ) = split( /\t/, $line ); $mx{$aa} = [ @scores ]; } # score the seq from the OP my $seq = 'EALLKQSWEVLKQNIPGHSLCLFALIIEAAPESKYVFSFLKDSNEIPENNPKLK' . 'AHAAVIFKTICESATELRQKGQAVWDNNTLKRLGSIHLKNKITDPHFEVMKGAL' . 'LGTIKEAVKENWSDEMCCAWTEAYNQLVATIKAEMKE'; my %scores; foreach my $start_pos ( 0 .. length( $seq ) - $nmer_size ) { my $subseq = substr( $seq, $start_pos, $nmer_size ); $scores{$start_pos} = score_seq( $subseq ); } # print a table of results in the same format as the website print join( "\t", 'Rank', 'Start Pos', 'Subsequence', 'Score' ), "\n"; my $rank = 1; foreach my $start_pos ( sort { $scores{$b} <=> $scores{$a} } keys %sco +res ) { print join( "\t", $rank, $start_pos + 1, substr( $seq, $start_pos, $nmer_size ), $scores{$start_pos} ), "\n"; $rank++; } sub score_seq { my ( $seq ) = @_; # score the sequence according to the process described at # http://thr.cit.nih.gov/molbio/hla_bind/hla_motif_search_info.htm +l my $score = $final_const; foreach my $pos ( 0 .. length( $seq ) - 1 ) { # error checking should be added (invalid characters, etc) $score *= $mx{ substr( $seq, $pos, 1 ) }[$pos]; } return( sprintf( "%.3f", $score ) ); } __DATA__ A 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1. +000 C 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 1. +000 D 0.075 0.100 0.400 4.100 1.000 1.000 0.490 1.000 0. +003 E 0.075 1.400 0.064 4.100 1.000 1.000 0.490 1.000 0. +003 F 4.600 0.050 3.700 1.000 3.800 1.900 5.800 5.500 0. +015 G 1.000 0.470 1.000 1.000 1.000 1.000 0.130 1.000 0. +015 H 0.034 0.050 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 I 1.700 9.900 1.000 1.000 1.000 2.300 1.000 0.410 2. +100 K 3.500 0.100 0.035 1.000 1.000 1.000 1.000 1.000 0. +003 L 1.700 72.000 3.700 1.000 1.000 2.300 1.000 1.000 4. +300 M 1.700 52.000 3.700 1.000 1.000 2.300 1.000 1.000 1. +000 N 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 P 0.022 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +003 Q 1.000 7.300 1.000 1.000 1.000 1.000 1.000 1.000 0. +003 R 1.000 0.010 0.076 1.000 1.000 1.000 0.200 1.000 0. +003 S 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 T 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1. +500 V 1.700 6.300 1.000 1.000 1.000 2.300 1.000 0.410 14 +.000 W 4.600 0.010 8.300 1.000 1.000 1.700 7.500 5.500 0. +015 Y 4.600 0.010 3.200 1.000 1.000 1.500 1.000 5.500 0. +015 [download] Here is the output - the scores are exactly the same as the ones I got when I used the website. Read more... (5 kB) It was very easy to implement this locally, as I expected, and I saved myself from having to write a whole bunch of code for the web submission and parsing. It is trivial to change this example to use a different scoring matrix - just copy and paste from the site (into the __DATA__ section). If I were doing this project, I'd probably put all of the scoring data for each matrix into a config-like file, then write a module to read it and score sequences, using the same parameters as the site. I hope this helps.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: bioinformatics problem - local scoring solution
by bobf (Monsignor) on Jun 03, 2006 at 06:04 UTC

Yes, the subsequences are created by taking overlapping windows of sequences starting at position 1, 2, etc.

I whipped up some code that replicates the scoring function described on the site. It gave me exactly the same results as the website version, but performing the calculation locally gave me all of the advantages I listed in my previous post plus:

the local scoring is faster than using the website
this code does not suffer from the 5000 residue limit imposed by the site

The only caveat is that the rankings for subsequences with equivalent scores might be shuffled (the site gave no rules as to how sequences with the same score are ordered).

The code below uses the default settings and the sequence from the OP. I know it could be cleaned up a bit, but the steps should be clear.

# HLA Peptide Binding Predictions
# http://thr.cit.nih.gov/molbio/hla_bind/

use warnings;
use strict;

use Data::Dumper;

# load the scoring matrix
# 9-mer Coefficient Table for HLA_A_0201 (A_0201_standard)

my $nmer_size = 9;
my $final_const = 0.069;
my %mx;
while( my $line = <DATA> )
{
    chomp $line;
    my ( $aa, @scores ) = split( /\t/, $line );
    $mx{$aa} = [ @scores ];
}

# score the seq from the OP

my $seq = 'EALLKQSWEVLKQNIPGHSLCLFALIIEAAPESKYVFSFLKDSNEIPENNPKLK' .
          'AHAAVIFKTICESATELRQKGQAVWDNNTLKRLGSIHLKNKITDPHFEVMKGAL' .
          'LGTIKEAVKENWSDEMCCAWTEAYNQLVATIKAEMKE';

my %scores;
foreach my $start_pos ( 0 .. length( $seq ) - $nmer_size )
{
    my $subseq = substr( $seq, $start_pos, $nmer_size );
    $scores{$start_pos} = score_seq( $subseq );
}

# print a table of results in the same format as the website

print join( "\t", 'Rank', 'Start Pos', 'Subsequence', 'Score' ), "\n";

my $rank = 1;
foreach my $start_pos ( sort { $scores{$b} <=> $scores{$a} } keys %sco
+res )
{
    print join( "\t", $rank, $start_pos + 1,
                      substr( $seq, $start_pos, $nmer_size ),
                      $scores{$start_pos} ), "\n";
    $rank++;
}

sub score_seq
{
    my ( $seq ) = @_;

    # score the sequence according to the process described at
    # http://thr.cit.nih.gov/molbio/hla_bind/hla_motif_search_info.htm
+l

    my $score = $final_const;
    foreach my $pos ( 0 .. length( $seq ) - 1 )
    {
        # error checking should be added (invalid characters, etc)
        $score *= $mx{ substr( $seq, $pos, 1 ) }[$pos];
    }

    return( sprintf( "%.3f", $score ) );
}

__DATA__
A   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.
+000
C   1.000   0.470   1.000   1.000   1.000   1.000   1.000   1.000   1.
+000
D   0.075   0.100   0.400   4.100   1.000   1.000   0.490   1.000   0.
+003
E   0.075   1.400   0.064   4.100   1.000   1.000   0.490   1.000   0.
+003
F   4.600   0.050   3.700   1.000   3.800   1.900   5.800   5.500   0.
+015
G   1.000   0.470   1.000   1.000   1.000   1.000   0.130   1.000   0.
+015
H   0.034   0.050   1.000   1.000   1.000   1.000   1.000   1.000   0.
+015
I   1.700   9.900   1.000   1.000   1.000   2.300   1.000   0.410   2.
+100
K   3.500   0.100   0.035   1.000   1.000   1.000   1.000   1.000   0.
+003
L   1.700   72.000  3.700   1.000   1.000   2.300   1.000   1.000   4.
+300
M   1.700   52.000  3.700   1.000   1.000   2.300   1.000   1.000   1.
+000
N   1.000   0.470   1.000   1.000   1.000   1.000   1.000   1.000   0.
+015
P   0.022   0.470   1.000   1.000   1.000   1.000   1.000   1.000   0.
+003
Q   1.000   7.300   1.000   1.000   1.000   1.000   1.000   1.000   0.
+003
R   1.000   0.010   0.076   1.000   1.000   1.000   0.200   1.000   0.
+003
S   1.000   0.470   1.000   1.000   1.000   1.000   1.000   1.000   0.
+015
T   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.
+500
V   1.700   6.300   1.000   1.000   1.000   2.300   1.000   0.410   14
+.000
W   4.600   0.010   8.300   1.000   1.000   1.700   7.500   5.500   0.
+015
Y   4.600   0.010   3.200   1.000   1.000   1.500   1.000   5.500   0.
+015
[download]

Here is the output - the scores are exactly the same as the ones I got when I used the website.