I do wonder why you're trying to scrape HTML from this site when the life sciences community is actually very far ahead of the game when it comes to providing data in machine-readable formats.
Taking one of the proteins from your list...
use 5.010; use RDF::Trine; use RDF::QueryX::Lazy; my $protein = 'P61177'; my $pr_url = sprintf 'http://purl.uniprot.org/uniprot/%s', $protein; my $database = RDF::Trine::Model->new; RDF::Trine::Parser->parse_url_into_model($pr_url => $database); my $query = RDF::QueryX::Lazy->new(<<"QUERY"); SELECT * WHERE { <$pr_url> rdfs:seeAlso ?fam_url . ?fam_url rdfs:comment ?family . FILTER regex(STR(?fam_url), "^http://purl\.uniprot\.org/interpro/" +) } QUERY my $results = $query->execute($database); while (my $result = $results->next()) { say $result->{family}, q[ ], $result->{fam_url}; }
Running that produces the following output:
"Ribosomal_L22_bac-type" <http://purl.uniprot.org/interpro/IPR005727> "Ribosomal_L22/L17_CS" <http://purl.uniprot.org/interpro/IPR018260> "Ribosomal_L22" <http://purl.uniprot.org/interpro/IPR001063>
In reply to Re: Reading particular information from Html page and skipping the page that doesn't contain that information
by tobyink
in thread Reading particular information from Html page and skipping the page that doesn't contain that information
by rmgzsm9
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |