in reply to Reading particular information from Html page and skipping the page that doesn't contain that information
I do wonder why you're trying to scrape HTML from this site when the life sciences community is actually very far ahead of the game when it comes to providing data in machine-readable formats.
Taking one of the proteins from your list...
use 5.010; use RDF::Trine; use RDF::QueryX::Lazy; my $protein = 'P61177'; my $pr_url = sprintf 'http://purl.uniprot.org/uniprot/%s', $protein; my $database = RDF::Trine::Model->new; RDF::Trine::Parser->parse_url_into_model($pr_url => $database); my $query = RDF::QueryX::Lazy->new(<<"QUERY"); SELECT * WHERE { <$pr_url> rdfs:seeAlso ?fam_url . ?fam_url rdfs:comment ?family . FILTER regex(STR(?fam_url), "^http://purl\.uniprot\.org/interpro/" +) } QUERY my $results = $query->execute($database); while (my $result = $results->next()) { say $result->{family}, q[ ], $result->{fam_url}; }
Running that produces the following output:
"Ribosomal_L22_bac-type" <http://purl.uniprot.org/interpro/IPR005727> "Ribosomal_L22/L17_CS" <http://purl.uniprot.org/interpro/IPR018260> "Ribosomal_L22" <http://purl.uniprot.org/interpro/IPR001063>
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Reading particular information from Html page and skipping the page that doesn't contain that information
by rmgzsm9 (Novice) on May 23, 2012 at 16:22 UTC | |
by marto (Cardinal) on May 23, 2012 at 16:28 UTC |