in reply to Re: bioinformatics problem
in thread bioinformatics problem

hi

I looked through the site with the information.. but with regard tothe scoring function, it does not say how the subset of amino acids are to be taken... like is it from 1-9, 2-10 and so on.. and it says the scores are just multiplied.. i tried that using the given scoring matrix but i didnt get the score that was reported... how did u get the scoring method?

arvnd

Replies are listed 'Best First'.
Re^3: bioinformatics problem - local scoring solution
by bobf (Monsignor) on Jun 03, 2006 at 06:04 UTC

    Yes, the subsequences are created by taking overlapping windows of sequences starting at position 1, 2, etc.

    I whipped up some code that replicates the scoring function described on the site. It gave me exactly the same results as the website version, but performing the calculation locally gave me all of the advantages I listed in my previous post plus:

    • the local scoring is faster than using the website
    • this code does not suffer from the 5000 residue limit imposed by the site

    The only caveat is that the rankings for subsequences with equivalent scores might be shuffled (the site gave no rules as to how sequences with the same score are ordered).

    The code below uses the default settings and the sequence from the OP. I know it could be cleaned up a bit, but the steps should be clear.

    # HLA Peptide Binding Predictions # http://thr.cit.nih.gov/molbio/hla_bind/ use warnings; use strict; use Data::Dumper; # load the scoring matrix # 9-mer Coefficient Table for HLA_A_0201 (A_0201_standard) my $nmer_size = 9; my $final_const = 0.069; my %mx; while( my $line = <DATA> ) { chomp $line; my ( $aa, @scores ) = split( /\t/, $line ); $mx{$aa} = [ @scores ]; } # score the seq from the OP my $seq = 'EALLKQSWEVLKQNIPGHSLCLFALIIEAAPESKYVFSFLKDSNEIPENNPKLK' . 'AHAAVIFKTICESATELRQKGQAVWDNNTLKRLGSIHLKNKITDPHFEVMKGAL' . 'LGTIKEAVKENWSDEMCCAWTEAYNQLVATIKAEMKE'; my %scores; foreach my $start_pos ( 0 .. length( $seq ) - $nmer_size ) { my $subseq = substr( $seq, $start_pos, $nmer_size ); $scores{$start_pos} = score_seq( $subseq ); } # print a table of results in the same format as the website print join( "\t", 'Rank', 'Start Pos', 'Subsequence', 'Score' ), "\n"; my $rank = 1; foreach my $start_pos ( sort { $scores{$b} <=> $scores{$a} } keys %sco +res ) { print join( "\t", $rank, $start_pos + 1, substr( $seq, $start_pos, $nmer_size ), $scores{$start_pos} ), "\n"; $rank++; } sub score_seq { my ( $seq ) = @_; # score the sequence according to the process described at # http://thr.cit.nih.gov/molbio/hla_bind/hla_motif_search_info.htm +l my $score = $final_const; foreach my $pos ( 0 .. length( $seq ) - 1 ) { # error checking should be added (invalid characters, etc) $score *= $mx{ substr( $seq, $pos, 1 ) }[$pos]; } return( sprintf( "%.3f", $score ) ); } __DATA__ A 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1. +000 C 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 1. +000 D 0.075 0.100 0.400 4.100 1.000 1.000 0.490 1.000 0. +003 E 0.075 1.400 0.064 4.100 1.000 1.000 0.490 1.000 0. +003 F 4.600 0.050 3.700 1.000 3.800 1.900 5.800 5.500 0. +015 G 1.000 0.470 1.000 1.000 1.000 1.000 0.130 1.000 0. +015 H 0.034 0.050 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 I 1.700 9.900 1.000 1.000 1.000 2.300 1.000 0.410 2. +100 K 3.500 0.100 0.035 1.000 1.000 1.000 1.000 1.000 0. +003 L 1.700 72.000 3.700 1.000 1.000 2.300 1.000 1.000 4. +300 M 1.700 52.000 3.700 1.000 1.000 2.300 1.000 1.000 1. +000 N 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 P 0.022 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +003 Q 1.000 7.300 1.000 1.000 1.000 1.000 1.000 1.000 0. +003 R 1.000 0.010 0.076 1.000 1.000 1.000 0.200 1.000 0. +003 S 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 T 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1. +500 V 1.700 6.300 1.000 1.000 1.000 2.300 1.000 0.410 14 +.000 W 4.600 0.010 8.300 1.000 1.000 1.700 7.500 5.500 0. +015 Y 4.600 0.010 3.200 1.000 1.000 1.500 1.000 5.500 0. +015

    Here is the output - the scores are exactly the same as the ones I got when I used the website.

    It was very easy to implement this locally, as I expected, and I saved myself from having to write a whole bunch of code for the web submission and parsing.

    It is trivial to change this example to use a different scoring matrix - just copy and paste from the site (into the __DATA__ section). If I were doing this project, I'd probably put all of the scoring data for each matrix into a config-like file, then write a module to read it and score sequences, using the same parameters as the site.

    I hope this helps.