rmgzsm9 has asked for the wisdom of the Perl Monks concerning the following question:

I have a list ID codes (UniProt Codes). Each code representative of a protein. There is website called InterPro that is used to deal with proteins related information. URL for that website contains that particular code. By changing that code in that URL I can get information about any protein. I wrote a program that can retrieve HTML page from the url and the retrieve information written under the heading of "protein family membership". I did this usinf HTML::treebuilder. Code is written below:

use LWP::Simple; use HTML::TreeBuilder; my @ports=qw( P23141 P61177 P60725 P30542 P21817 P29274 Q07343 P08172 P20309 Q9GZZ6 ); for (my $i=0;$i < scalar(@ports);$i++) { my $url= "http://wwwdev.ebi.ac.uk/interpro/ISearch?query=".$ports[$i]. +"+"; my $resp = get( $url ); my $tree = HTML::TreeBuilder->new_from_content($resp); my $first=$tree->look_down(_tag => 'div',class => 'prot_fam'); $first=$first->look_down(_tag => 'div',class => 'entry-parent'); $first=$first->look_down(_tag => 'div',class => 'entry-parent'); $first=$first->look_down(_tag => 'a'); open (FH,">>result.txt"); print FH $ports[$i].";"; print FH $first->content_list; print FH "\n"; close(FH); }

Now the problem is this code goes well if there is a family name after prot_fam then parent entry(2 times) in the HTML source page. However, when a family is not defined the structure of the source of html webpage is actually different; after the line with 'prot_fam' there is written "No family membership assigned", but there is no 'entry parent' in the following lines. And when this perl script finds an entry in a list of codes where there is "No family membership assigned" written on webpage, it stops working forward. What I want is that this perl script should move forward and it should skip the entries with "No family membership assigned" information. I am new to perl. Please help me solve this problem. I will be grateful.

  • Comment on Reading particular information from Html page and skipping the page that doesn't contain that information
  • Download Code

Replies are listed 'Best First'.
Re: Reading particular information from Html page and skipping the page that doesn't contain that information
by tobyink (Canon) on May 23, 2012 at 14:05 UTC

    I do wonder why you're trying to scrape HTML from this site when the life sciences community is actually very far ahead of the game when it comes to providing data in machine-readable formats.

    Taking one of the proteins from your list...

    use 5.010; use RDF::Trine; use RDF::QueryX::Lazy; my $protein = 'P61177'; my $pr_url = sprintf 'http://purl.uniprot.org/uniprot/%s', $protein; my $database = RDF::Trine::Model->new; RDF::Trine::Parser->parse_url_into_model($pr_url => $database); my $query = RDF::QueryX::Lazy->new(<<"QUERY"); SELECT * WHERE { <$pr_url> rdfs:seeAlso ?fam_url . ?fam_url rdfs:comment ?family . FILTER regex(STR(?fam_url), "^http://purl\.uniprot\.org/interpro/" +) } QUERY my $results = $query->execute($database); while (my $result = $results->next()) { say $result->{family}, q[ ], $result->{fam_url}; }

    Running that produces the following output:

    "Ribosomal_L22_bac-type" <http://purl.uniprot.org/interpro/IPR005727>
    "Ribosomal_L22/L17_CS" <http://purl.uniprot.org/interpro/IPR018260>
    "Ribosomal_L22" <http://purl.uniprot.org/interpro/IPR001063>
    
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Thanks for the help. So are you saying there is a other way to get that information from InterPro? Because I looked for it and since I didn't find an easier option I started trying with Perl. If there's an other way to retrieve the protein family information, matching it with the UniProt codes, please let me know!

Re: Reading particular information from Html page and skipping the page that doesn't contain that information
by Anonymous Monk on May 23, 2012 at 13:28 UTC
    User their API http://www.ebi.ac.uk/Tools/webservices/, http://www.ebi.ac.uk/Tools/webservices/help/faq. Don't automate the web page and respect the 'fair share' policy they operate.

      Excellent, Anonymous Monk. I applaud your encouraging respect for an institution's sharing policy.

        As an aside, “following such admonitions when they are made by an important data-provider” is not entirely altruistic:   in practice, it is a very important survival strategy.   When you design and build a business system that is intended to obtain information from a third-party data source, you obviously (should...) be designing that system to be durable.   You want to “do it right the first time such that you never have to re-visit it again,” not merely “to design an oh-so lokkit-me I’m-so-gosh-darned clever” ... hack.   You do not want to show up to work one day to find that the data provider made some slight change to their web-page last night and now thirteen hours (and counting) of vital production activity is scroo’d up with y-o-u-r asterisk-is-grass name on it.   Therefore, if a provider tells you the right way to do something, do it.

        (Trust me on this one ....)

Re: Reading particular information from Html page and skipping the page that doesn't contain that information
by daxim (Curate) on May 23, 2012 at 13:25 UTC