smandape1 has asked for the wisdom of the Perl Monks concerning the following question:
Hello experts, I am trying to get data from web using web API. I am getting the data but what I want is specific. I want to extract identifiers specific to PMIDs from the code, of which the XML format, looks something like this
<Post rdf:about="http://www.connotea.org/user/lrlucena/uri/111b2eeb65 +471b9866c833929901564b"><title>The structure of scientific collaborat +ion networks.</title><updated>2007-11-10T14:38:49Z</updated> <uri><dcterms:URI rdf:about="http://www.ncbi.nlm.nih.gov/entrez/query. +fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=11149952&quer +y_hl=22&itool=pubmed_docsum"> <dc:title>Entrez PubMed</dc:title><link>http://www.ncbi.nlm.nih.gov/en +trez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=11 +149952&query_hl=22&itool=pubmed_docsum</link><hash>111b2eeb65471b9866 +c833929901564b</hash> <citation><rdf:Description><citationID>888849</citationID><prism:title +>From the Cover: The structure of scientific collaboration networks</ +prism:title> <dc:date>2001-01-16T00:00:00Z</dc:date><journalID>212176</journalID> <prism:publicationName>Proc Natl Acad Sci U S A</prism:publicationName +><prism:endingPage>409</prism:endingPage> <doiResolver rdf:resource="http://dx.doi.org/10.1073/pnas.021544898"/> +<dc:identifier>doi:10.1073/pnas.021544898</dc:identifier><pmidResolve +r rdf:resource="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Ret +rieve&db=pubmed&dopt=Abstract&list_uids=11149952"/><dc:identifier>PMI +D: 11149952</dc:identifier></rdf:Description></citation></dcterms:URI +></uri></Post>
When I try to extract the identifiers for which the code is something like this
use lib 'C:/Perl64/www-connotea-perl-0.1/lib/'; my $fn0="extracted-connotea-pubmedID-2.1.txt"; open (IN0, $fn0) or die "Can't open $fn0: $!\n"; open (FH, ">:utf8",'bookmarks_only_for_PubmedID.txt'); use lib '../lib'; use WWW::Connotea; my $currentURI; my $PMID; my $c = WWW::Connotea->new( user => 'myusername', password => '...... +..' ); $c->authenticate; ### dies if log-in credentials are incorrect while (<IN0>) { my $currentURI = $_; chomp($currentURI); my @tags = $c->posts_for(uri =>"$currentURI"); die "No candidate related articles\n" unless @tags; print FH "$currentURI\n"; foreach my $tag (@tags) { print FH "PMID: "; my $boo = $tag->bookmark(); my $foo = $boo->citation(); print FH $foo->identifiers(), "\n"; my $bar = grep(/PMID:^/, $foo->identifiers()); print FH $bar, "\n"; # if ($foo->identifiers() =~ m/(PMID:^)^.(\d+^)^/) # { # print FH "$2\n"; # } } } close IN0; close FH;
My issue is not about getting the data. The source file sometimes had either only the PMID or has two identifiers named doi and PMID. When I run this code to get the identifiers() I get both of them and the output looks something like this
http://www.ncbi.nlm.nih.gov/pubmed/15754555 PMID: PMID: 15754555 http://www.ncbi.nlm.nih.gov/pubmed/4012367 PMID: PMID: 4012367 http://www.ncbi.nlm.nih.gov/pubmed/20215333 PMID: doi:10.1093/fampra/cmq003PMID: 20215333 http://www.ncbi.nlm.nih.gov/pubmed/20429974 PMID: PMID: 20429974 http://www.ncbi.nlm.nih.gov/pubmed/20338007 PMID: doi:10.1111/j.1600-0838.2009.01081.xPMID: 20338007 http://www.ncbi.nlm.nih.gov/pubmed/17438827 PMID: PMID: 17438827 http://www.ncbi.nlm.nih.gov/pubmed/17447555 PMID: PMID: 17447555 http://www.ncbi.nlm.nih.gov/pubmed/17450784 PMID: PMID: 17450784
I want to have the output only with PMIDs something like this
http://www.ncbi.nlm.nih.gov/pubmed/15754555 PMID: 15754555 http://www.ncbi.nlm.nih.gov/pubmed/4012367 PMID: 4012367 http://www.ncbi.nlm.nih.gov/pubmed/20215333 PMID: 20215333 http://www.ncbi.nlm.nih.gov/pubmed/20429974 PMID: 20429974 http://www.ncbi.nlm.nih.gov/pubmed/20338007 PMID: 20338007 http://www.ncbi.nlm.nih.gov/pubmed/17438827 PMID: 17438827 http://www.ncbi.nlm.nih.gov/pubmed/17447555 PMID: 17447555 http://www.ncbi.nlm.nih.gov/pubmed/17450784 PMID: 17450784
I am trying to use regex but not quite sure where am I going wrong. Experts please help me. Thank you, Sammed
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Extracting web data
by zek152 (Pilgrim) on Jun 13, 2011 at 14:19 UTC | |
by smandape1 (Acolyte) on Jun 13, 2011 at 14:37 UTC | |
by zek152 (Pilgrim) on Jun 13, 2011 at 14:42 UTC | |
by Anonymous Monk on Jun 13, 2011 at 17:02 UTC | |
by zek152 (Pilgrim) on Jun 13, 2011 at 17:12 UTC | |
|