in reply to Reading particular information from Html page and skipping the page that doesn't contain that information

I do wonder why you're trying to scrape HTML from this site when the life sciences community is actually very far ahead of the game when it comes to providing data in machine-readable formats.

Taking one of the proteins from your list...

use 5.010; use RDF::Trine; use RDF::QueryX::Lazy; my $protein = 'P61177'; my $pr_url = sprintf 'http://purl.uniprot.org/uniprot/%s', $protein; my $database = RDF::Trine::Model->new; RDF::Trine::Parser->parse_url_into_model($pr_url => $database); my $query = RDF::QueryX::Lazy->new(<<"QUERY"); SELECT * WHERE { <$pr_url> rdfs:seeAlso ?fam_url . ?fam_url rdfs:comment ?family . FILTER regex(STR(?fam_url), "^http://purl\.uniprot\.org/interpro/" +) } QUERY my $results = $query->execute($database); while (my $result = $results->next()) { say $result->{family}, q[ ], $result->{fam_url}; }

Running that produces the following output:

"Ribosomal_L22_bac-type" <http://purl.uniprot.org/interpro/IPR005727>
"Ribosomal_L22/L17_CS" <http://purl.uniprot.org/interpro/IPR018260>
"Ribosomal_L22" <http://purl.uniprot.org/interpro/IPR001063>
perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
  • Comment on Re: Reading particular information from Html page and skipping the page that doesn't contain that information
  • Download Code

Replies are listed 'Best First'.
Re^2: Reading particular information from Html page and skipping the page that doesn't contain that information
by rmgzsm9 (Novice) on May 23, 2012 at 16:22 UTC

    Thanks for the help. So are you saying there is a other way to get that information from InterPro? Because I looked for it and since I didn't find an easier option I started trying with Perl. If there's an other way to retrieve the protein family information, matching it with the UniProt codes, please let me know!