bioinformatics problem

mutatedgene has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: bioinformatics problem by wfsp (Abbot) on May 27, 2006 at 11:00 UTC
To get you started. ...feeding the data to the online tool... Have a look at the LWP and WWW-Mechanize ...getting the results... Have a look at HTML::TokeParser Update: Added WWW-Mechanize Update 2: This gets the html: #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; # field name: inseq my $url = q\|http://thr.cit.nih.gov/molbio/hla_bind/index.shtml\|; my $seq = 'EALLKQSWEVLKQNIPGHSLCLFALIIEAAPESKYVFSFLKDSNEIPENNPKLKAHAAV +IFKTICESATE LRQKGQAVWDNNTLKRLGSIHLKNKITDPHFEVMKGALLGTIKEAVKENWSDEMCCAWTEAYNQLVATIK AEMKE'; my $mech = WWW::Mechanize->new() or die "couldn't get Mech object: $!"; $mech->get($url) or die "couldn't 'get': $!"; $mech->submit_form( form_number => 1, fields => { inseq => $seq, } ) or die "couldn't submit form"; my $html = $mech->content() or die "content failed: $!"; { my $file = 'bio_output.html'; open my $fh, '>', $file or die "can't open $file: $!"; print $fh $html; close $fh; } [download] The error checking may be a <cough> tad excessive :-) Update 3 Here's my go at extracting the data: `#!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; my $html; { local $/; my $file = 'bio_output.html'; open my $fh, '<', $file or die "can't open $file: $!"; $html = <$fh>; close $fh; } my $t = HTML::TableExtract->new(); $t->parse($html); my $report = $t->tables_report(1); print $report;` [download] output: ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" monk18.pl TABLE(0, 0): User Parameters and Scoring Information: method selected to limit number of results:explicit number number of results requested:20 HLA molecule type selected:A_0201 length selected for subsequences to be scored:9 echoing mode selected for input sequence:Y echoing format:numbered lines length of user's input peptide sequence:145 number of subsequence scores calculated:137 number of top-scoring subsequences reported back in scoring output tab +le:20 TABLE(0, 1): Scoring Results::: Rank:Start Position:Subsequence Residue Listing:Score (Estimate of Hal +f Time of Disassociation of a Molecule Containing This Subsequence) 1: 2:ALLKQSWEV:1930.068 2: 95:KITDPHFEV: 795.962 3: 108:LLGTIKEAV: 57.937 4: 107:ALLGTIKEA: 42.278 5: 21:CLFALIIEA: 42.278 6: 19:SLCLFALII: 16.254 7: 63:TICESATEL: 12.043 8: 12:KQNIPGHSL: 7.581 9: 14:NIPGHSLCL: 2.937 10: 101:FEVMKGALL: 1.911 11: 133:NQLVATIKA: 1.864 12: 35:YVFSFLKDS: 0.970 13: 45:EIPENNPKL: 0.903 14: 75:GQAVWDNNT: 0.756 15: 135:LVATIKAEM: 0.739 16: 103:VMKGALLGT: 0.737 17: 52:KLKAHAAVI: 0.524 18: 123:EMCCAWTEA: 0.457 19: 3:LLKQSWEVL: 0.434 20: 89:SIHLKNKIT: 0.420 [download]	[reply] [d/l] [select]
Re^2: bioinformatics problem by mutatedgene (Novice) on May 31, 2006 at 08:08 UTC
hi thanks a lot for the help.. but i dont seem to be able to install teh lwp and the mehanize modules.. do i need admin permission to install? and im going through proxy server.. pl help!! arvind	[reply]
Re: bioinformatics problem by bobf (Monsignor) on May 27, 2006 at 18:58 UTC
Instead of hitting the site more than 1000 times (or more, will this analysis be repeated?), have you considered doing the scoring on your end? The link you provided contains a link that describes how the scores are calculated (http://thr.cit.nih.gov/molbio/hla_bind/hla_motif_search_info.html), and all of the coefficient tables are available (http://thr.cit.nih.gov/cgi-bin/molbio/hla_coefficient_viewing_page). From what I can tell, the developers of that site made it very easy for anyone else to replicate the scoring function (which appears very simple) locally. In addition to the warm, fuzzy feeling you'll get from not swamping their site, doing this locally has several other advantages, including: you're not dependent on their server (uptime, load) you don't have to write code to submit the data to their form (which is subject to format changes) you don't have to parse the HTML tables in the results page (which are subject to format changes) I did a similar project a while ago. The time I saved by not having to write all the code to submit the input data via the form and parse the output (HTML pages) more than made up for the little time it took to create a module that contains the data and scoring function. In addition, I know that I'll never have to debug and recode anything because someone decided to change the format of one of the output tables.	[reply]
Re^2: bioinformatics problem by mutatedgene (Novice) on May 29, 2006 at 08:12 UTC
hey .. thanks a lot for enlightening me on that one.. never saw that...am going thro the site right now.. and your idea seems a lot simpler.. will try to do that one.. thanks for the help once again! arvind	[reply]
Re^2: bioinformatics problem by mutatedgene (Novice) on May 31, 2006 at 06:19 UTC
hi I looked through the site with the information.. but with regard tothe scoring function, it does not say how the subset of amino acids are to be taken... like is it from 1-9, 2-10 and so on.. and it says the scores are just multiplied.. i tried that using the given scoring matrix but i didnt get the score that was reported... how did u get the scoring method? arvnd	[reply]
Re^3: bioinformatics problem - local scoring solution by bobf (Monsignor) on Jun 03, 2006 at 06:04 UTC
Yes, the subsequences are created by taking overlapping windows of sequences starting at position 1, 2, etc. I whipped up some code that replicates the scoring function described on the site. It gave me exactly the same results as the website version, but performing the calculation locally gave me all of the advantages I listed in my previous post plus: the local scoring is faster than using the website this code does not suffer from the 5000 residue limit imposed by the site The only caveat is that the rankings for subsequences with equivalent scores might be shuffled (the site gave no rules as to how sequences with the same score are ordered). The code below uses the default settings and the sequence from the OP. I know it could be cleaned up a bit, but the steps should be clear. # HLA Peptide Binding Predictions # http://thr.cit.nih.gov/molbio/hla_bind/ use warnings; use strict; use Data::Dumper; # load the scoring matrix # 9-mer Coefficient Table for HLA_A_0201 (A_0201_standard) my $nmer_size = 9; my $final_const = 0.069; my %mx; while( my $line = <DATA> ) { chomp $line; my ( $aa, @scores ) = split( /\t/, $line ); $mx{$aa} = [ @scores ]; } # score the seq from the OP my $seq = 'EALLKQSWEVLKQNIPGHSLCLFALIIEAAPESKYVFSFLKDSNEIPENNPKLK' . 'AHAAVIFKTICESATELRQKGQAVWDNNTLKRLGSIHLKNKITDPHFEVMKGAL' . 'LGTIKEAVKENWSDEMCCAWTEAYNQLVATIKAEMKE'; my %scores; foreach my $start_pos ( 0 .. length( $seq ) - $nmer_size ) { my $subseq = substr( $seq, $start_pos, $nmer_size ); $scores{$start_pos} = score_seq( $subseq ); } # print a table of results in the same format as the website print join( "\t", 'Rank', 'Start Pos', 'Subsequence', 'Score' ), "\n"; my $rank = 1; foreach my $start_pos ( sort { $scores{$b} <=> $scores{$a} } keys %sco +res ) { print join( "\t", $rank, $start_pos + 1, substr( $seq, $start_pos, $nmer_size ), $scores{$start_pos} ), "\n"; $rank++; } sub score_seq { my ( $seq ) = @_; # score the sequence according to the process described at # http://thr.cit.nih.gov/molbio/hla_bind/hla_motif_search_info.htm +l my $score = $final_const; foreach my $pos ( 0 .. length( $seq ) - 1 ) { # error checking should be added (invalid characters, etc) $score *= $mx{ substr( $seq, $pos, 1 ) }[$pos]; } return( sprintf( "%.3f", $score ) ); } __DATA__ A 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1. +000 C 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 1. +000 D 0.075 0.100 0.400 4.100 1.000 1.000 0.490 1.000 0. +003 E 0.075 1.400 0.064 4.100 1.000 1.000 0.490 1.000 0. +003 F 4.600 0.050 3.700 1.000 3.800 1.900 5.800 5.500 0. +015 G 1.000 0.470 1.000 1.000 1.000 1.000 0.130 1.000 0. +015 H 0.034 0.050 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 I 1.700 9.900 1.000 1.000 1.000 2.300 1.000 0.410 2. +100 K 3.500 0.100 0.035 1.000 1.000 1.000 1.000 1.000 0. +003 L 1.700 72.000 3.700 1.000 1.000 2.300 1.000 1.000 4. +300 M 1.700 52.000 3.700 1.000 1.000 2.300 1.000 1.000 1. +000 N 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 P 0.022 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +003 Q 1.000 7.300 1.000 1.000 1.000 1.000 1.000 1.000 0. +003 R 1.000 0.010 0.076 1.000 1.000 1.000 0.200 1.000 0. +003 S 1.000 0.470 1.000 1.000 1.000 1.000 1.000 1.000 0. +015 T 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1. +500 V 1.700 6.300 1.000 1.000 1.000 2.300 1.000 0.410 14 +.000 W 4.600 0.010 8.300 1.000 1.000 1.700 7.500 5.500 0. +015 Y 4.600 0.010 3.200 1.000 1.000 1.500 1.000 5.500 0. +015 [download] Here is the output - the scores are exactly the same as the ones I got when I used the website. Read more... (5 kB) It was very easy to implement this locally, as I expected, and I saved myself from having to write a whole bunch of code for the web submission and parsing. It is trivial to change this example to use a different scoring matrix - just copy and paste from the site (into the __DATA__ section). If I were doing this project, I'd probably put all of the scoring data for each matrix into a config-like file, then write a module to read it and score sequences, using the same parameters as the site. I hope this helps.	[reply] [d/l] [select]
Re: bioinformatics problem by sarani (Sexton) on May 28, 2006 at 06:13 UTC
Arvind, Like bobf I think you should generate your own program to do what BIMAS does, but from personal experience. That server crashed on me frequently this March when I was doing a project on it. It won't take much of your time to do that. If I can find the program my group wrote, I can put it here, but I rather doubt I'll be able to reach it anytime soon. On an aside, (but important, me thinks) if you are going to use BIMAS, then please use FASTA format, it's much safer. BIMAS does not read raw sequences less than a specific length (111 bps, I think). Your sample data has only 147. UPDATE: (reply to Arvind): It isn't too tough. :) and I'm still in college.	[reply]
Re^2: bioinformatics problem by mutatedgene (Novice) on May 29, 2006 at 08:21 UTC
Sarani, Nice to meet a person whos actually been thro this! thanks for the info!..like u said, am thinkin of doing the scoring locally.. hope that program wont be too tough.. im still at the beginner leevl... (maybe jus a lil better than a beginner:-)... btw.. wher r u working? Arvind	[reply]
Re: bioinformatics problem by xern (Beadle) on May 29, 2006 at 05:38 UTC
Hi, You may consider another solution for interacting with web site: FEAR::API. It is an integrated tool for site scraping. It can reduce your script size and save some of your keystrokes. Thank you.	[reply]
Re: bioinformatics problem by superfrink (Curate) on May 29, 2006 at 20:59 UTC
This does not answer your question but I just finished reading Beginning Perl for Bioinformatics by James Tisdall. If you are still pretty new to Perl and programming and have not read an introductary Perl book this one is probably a good bet. If you have been programming in other languages for a while or have been using Perl for a while then maybe another book would be better. You might find some suggestions over in If I could only own one Perl book, it would be:.	[reply]